Dataset Information

Sequence embedding for fast construction of guide trees for multiple sequence alignment.

ABSTRACT:

Background

The most widely used multiple sequence alignment methods require sequences to be clustered as an initial step. Most sequence clustering methods require a full distance matrix to be computed between all pairs of sequences. This requires memory and time proportional to N2 for N sequences. When N grows larger than 10,000 or so, this becomes increasingly prohibitive and can form a significant barrier to carrying out very large multiple alignments.

Results

In this paper, we have tested variations on a class of embedding methods that have been designed for clustering large numbers of complex objects where the individual distance calculations are expensive. These methods involve embedding the sequences in a space where the similarities within a set of sequences can be closely approximated without having to compute all pair-wise distances.

Conclusions

We show how this approach greatly reduces computation time and memory requirements for clustering large numbers of sequences and demonstrate the quality of the clusterings by benchmarking them as guide trees for multiple alignment. Source code is available for download from http://www.clustal.org/mbed.tgz.

SUBMITTER: Blackshields G

PROVIDER: S-EPMC2893182 | biostudies-literature | 2010 May

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Sequence embedding for fast construction of guide trees for multiple sequence alignment.

Blackshields Gordon G Sievers Fabian F Shi Weifeng W Wilm Andreas A Higgins Desmond G DG

Algorithms for molecular biology : AMB 20100514

<h4>Background</h4>The most widely used multiple sequence alignment methods require sequences to be clustered as an initial step. Most sequence clustering methods require a full distance matrix to be computed between all pairs of sequences. This requires memory and time proportional to N2 for N sequences. When N grows larger than 10,000 or so, this becomes increasingly prohibitive and can form a significant barrier to carrying out very large multiple alignments.<h4>Results</h4>In this paper, we ...[more]

PMID: 20470396

Dataset Information

Sequence embedding for fast construction of guide trees for multiple sequence alignment.

Background

Results

Conclusions

Publications

Sequence embedding for fast construction of guide trees for multiple sequence alignment.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

Fast alignment of fragmentation trees.
| S-EPMC3371839 | biostudies-literature

Simple chained guide trees give high-quality protein multiple sequence alignments.
| S-EPMC4115562 | biostudies-literature

FAMSA: Fast and accurate multiple sequence alignment of huge protein families.
| S-EPMC5037421 | biostudies-literature

Fast and robust multiple sequence alignment with phylogeny-aware gap placement.
| S-EPMC3495709 | biostudies-literature

QuickProbs--a fast multiple sequence alignment algorithm designed for graphics processors.
| S-EPMC3934876 | biostudies-literature

Simple chained guide trees give poorer multiple sequence alignments than inferred trees in simulation and phylogenetic benchmarks.
| S-EPMC4299236 | biostudies-literature

FMAlign2: a novel fast multiple nucleotide sequence alignment method for ultralong datasets.
| S-EPMC10809904 | biostudies-literature

Application of the MAFFT sequence alignment program to large data-reexamination of the usefulness of chained guide trees.
| S-EPMC5079479 | biostudies-literature

The construction and use of log-odds substitution scores for multiple sequence alignment.
| S-EPMC2904766 | biostudies-literature

FOGSAA: Fast Optimal Global Sequence Alignment Algorithm.
| S-EPMC3638164 | biostudies-literature