Unknown

Dataset Information

0

The whole alignment and nothing but the alignment: the problem of spurious alignment flanks.


ABSTRACT: Pairwise sequence alignment is a ubiquitous tool for inferring the evolution and function of DNA, RNA and protein sequences. It is therefore essential to identify alignments arising by chance alone, i.e. spurious alignments. On one hand, if an entire alignment is spurious, statistical techniques for identifying and eliminating it are well known. On the other hand, if only a part of the alignment is spurious, elimination is much more problematic. In practice, even the sizes and frequencies of spurious subalignments remain unknown. This article shows that some common scoring schemes tend to overextend alignments and generate spurious alignment flanks up to hundreds of base pairs/amino acids in length. In the UCSC genome database, e.g. spurious flanks probably comprise >18% of the human-fugu genome alignment. To evaluate the possibility that chance alone generated a particular flank on a particular pairwise alignment, we provide a simple 'overalignment' P-value. The overalignment P-value can identify spurious alignment flanks, thereby eliminating potentially misleading inferences about evolution and function. Moreover, by explicitly demonstrating the tradeoff between over- and under-alignment, our methods guide the rational choice of scoring schemes for various alignment tasks.

SUBMITTER: Frith MC 

PROVIDER: S-EPMC2566872 | biostudies-literature | 2008 Oct

REPOSITORIES: biostudies-literature

altmetric image

Publications

The whole alignment and nothing but the alignment: the problem of spurious alignment flanks.

Frith Martin C MC   Park Yonil Y   Sheetlin Sergey L SL   Spouge John L JL  

Nucleic acids research 20080916 18


Pairwise sequence alignment is a ubiquitous tool for inferring the evolution and function of DNA, RNA and protein sequences. It is therefore essential to identify alignments arising by chance alone, i.e. spurious alignments. On one hand, if an entire alignment is spurious, statistical techniques for identifying and eliminating it are well known. On the other hand, if only a part of the alignment is spurious, elimination is much more problematic. In practice, even the sizes and frequencies of spu  ...[more]

Similar Datasets

| S-EPMC2588114 | biostudies-literature
| S-EPMC1734709 | biostudies-other
| S-EPMC2709253 | biostudies-literature
| S-EPMC6014708 | biostudies-literature
| S-EPMC3031037 | biostudies-literature
| S-EPMC3142524 | biostudies-literature
| S-EPMC4248324 | biostudies-literature
| S-EPMC3549825 | biostudies-literature
| S-EPMC4969266 | biostudies-literature
| S-EPMC4155257 | biostudies-other