Unknown

Dataset Information

0

A simple method to control over-alignment in the MAFFT multiple sequence alignment program.


ABSTRACT: We present a new feature of the MAFFT multiple alignment program for suppressing over-alignment (aligning unrelated segments). Conventional MAFFT is highly sensitive in aligning conserved regions in remote homologs, but the risk of over-alignment is recently becoming greater, as low-quality or noisy sequences are increasing in protein sequence databases, due, for example, to sequencing errors and difficulty in gene prediction.The proposed method utilizes a variable scoring matrix for different pairs of sequences (or groups) in a single multiple sequence alignment, based on the global similarity of each pair. This method significantly increases the correctly gapped sites in real examples and in simulations under various conditions. Regarding sensitivity, the effect of the proposed method is slightly negative in real protein-based benchmarks, and mostly neutral in simulation-based benchmarks. This approach is based on natural biological reasoning and should be compatible with many methods based on dynamic programming for multiple sequence alignment.The new feature is available in MAFFT versions 7.263 and higher. http://mafft.cbrc.jp/alignment/software/katoh@ifrec.osaka-u.ac.jpSupplementary data are available at Bioinformatics online.

SUBMITTER: Katoh K 

PROVIDER: S-EPMC4920119 | biostudies-literature | 2016 Jul

REPOSITORIES: biostudies-literature

altmetric image

Publications

A simple method to control over-alignment in the MAFFT multiple sequence alignment program.

Katoh Kazutaka K   Standley Daron M DM  

Bioinformatics (Oxford, England) 20160226 13


<h4>Motivation</h4>We present a new feature of the MAFFT multiple alignment program for suppressing over-alignment (aligning unrelated segments). Conventional MAFFT is highly sensitive in aligning conserved regions in remote homologs, but the risk of over-alignment is recently becoming greater, as low-quality or noisy sequences are increasing in protein sequence databases, due, for example, to sequencing errors and difficulty in gene prediction.<h4>Results</h4>The proposed method utilizes a vari  ...[more]

Similar Datasets

| S-EPMC2905546 | biostudies-literature
| S-EPMC548345 | biostudies-literature
| S-EPMC3603318 | biostudies-literature
| S-EPMC5079479 | biostudies-literature
| S-EPMC10148686 | biostudies-literature
| S-EPMC6041967 | biostudies-literature
| S-EPMC2228335 | biostudies-literature
| S-EPMC2387179 | biostudies-literature
| S-EPMC21241 | biostudies-other
| S-EPMC3160389 | biostudies-literature