Unknown

Dataset Information

0

Query-seeded iterative sequence similarity searching improves selectivity 5-20-fold.


ABSTRACT: Iterative similarity search programs, like psiblast, jackhmmer, and psisearch, are much more sensitive than pairwise similarity search methods like blast and ssearch because they build a position specific scoring model (a PSSM or HMM) that captures the pattern of sequence conservation characteristic to a protein family. But models are subject to contamination; once an unrelated sequence has been added to the model, homologs of the unrelated sequence will also produce high scores, and the model can diverge from the original protein family. Examination of alignment errors during psiblast PSSM contamination suggested a simple strategy for dramatically reducing PSSM contamination. psiblast PSSMs are built from the query-based multiple sequence alignment (MSA) implied by the pairwise alignments between the query model (PSSM, HMM) and the subject sequences in the library. When the original query sequence residues are inserted into gapped positions in the aligned subject sequence, the resulting PSSM rarely produces alignment over-extensions or alignments to unrelated sequences. This simple step, which tends to anchor the PSSM to the original query sequence and slightly increase target percent identity, can reduce the frequency of false-positive alignments more than 20-fold compared with psiblast and jackhmmer, with little loss in search sensitivity.

SUBMITTER: Pearson WR 

PROVIDER: S-EPMC5605230 | biostudies-literature | 2017 Apr

REPOSITORIES: biostudies-literature

altmetric image

Publications

Query-seeded iterative sequence similarity searching improves selectivity 5-20-fold.

Pearson William R WR   Li Weizhong W   Lopez Rodrigo R  

Nucleic acids research 20170401 7


Iterative similarity search programs, like psiblast, jackhmmer, and psisearch, are much more sensitive than pairwise similarity search methods like blast and ssearch because they build a position specific scoring model (a PSSM or HMM) that captures the pattern of sequence conservation characteristic to a protein family. But models are subject to contamination; once an unrelated sequence has been added to the model, homologs of the unrelated sequence will also produce high scores, and the model c  ...[more]

Similar Datasets

| S-EPMC3125773 | biostudies-literature
| S-EPMC5006591 | biostudies-literature
| S-EPMC4521371 | biostudies-literature
| S-EPMC4830209 | biostudies-literature
| S-EPMC7352980 | biostudies-literature
| S-EPMC2853128 | biostudies-literature
2014-11-03 | E-GEOD-57175 | biostudies-arrayexpress
| S-EPMC6272706 | biostudies-literature
| S-EPMC4382900 | biostudies-literature
| S-EPMC3984865 | biostudies-other