Dataset Information

Rapid detection, classification and accurate alignment of up to a million or more related protein sequences.

ABSTRACT:

Motivation

The patterns of sequence similarity and divergence present within functionally diverse, evolutionarily related proteins contain implicit information about corresponding biochemical similarities and differences. A first step toward accessing such information is to statistically analyze these patterns, which, in turn, requires that one first identify and accurately align a very large set of protein sequences. Ideally, the set should include many distantly related, functionally divergent subgroups. Because it is extremely difficult, if not impossible for fully automated methods to align such sequences correctly, researchers often resort to manual curation based on detailed structural and biochemical information. However, multiply-aligning vast numbers of sequences in this way is clearly impractical.

Results

This problem is addressed using Multiply-Aligned Profiles for Global Alignment of Protein Sequences (MAPGAPS). The MAPGAPS program uses a set of multiply-aligned profiles both as a query to detect and classify related sequences and as a template to multiply-align the sequences. It relies on Karlin-Altschul statistics for sensitivity and on PSI-BLAST (and other) heuristics for speed. Using as input a carefully curated multiple-profile alignment for P-loop GTPases, MAPGAPS correctly aligned weakly conserved sequence motifs within 33 distantly related GTPases of known structure. By comparison, the sequence- and structurally based alignment methods hmmalign and PROMALS3D misaligned at least 11 and 23 of these regions, respectively. When applied to a dataset of 65 million protein sequences, MAPGAPS identified, classified and aligned (with comparable accuracy) nearly half a million putative P-loop GTPase sequences.

Availability

A C++ implementation of MAPGAPS is available at http://mapgaps.igs.umaryland.edu.

Supplementary information

Supplementary data are available at Bioinformatics online.

SUBMITTER: Neuwald AF

PROVIDER: S-EPMC2732367 | biostudies-literature | 2009 Aug

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Rapid detection, classification and accurate alignment of up to a million or more related protein sequences.

Neuwald Andrew F AF

Bioinformatics (Oxford, England) 20090608 15

<h4>Motivation</h4>The patterns of sequence similarity and divergence present within functionally diverse, evolutionarily related proteins contain implicit information about corresponding biochemical similarities and differences. A first step toward accessing such information is to statistically analyze these patterns, which, in turn, requires that one first identify and accurately align a very large set of protein sequences. Ideally, the set should include many distantly related, functionally d ...[more]

PMID: 19505947

Dataset Information

Rapid detection, classification and accurate alignment of up to a million or more related protein sequences.

Motivation

Results

Availability

Supplementary information

Publications

Rapid detection, classification and accurate alignment of up to a million or more related protein sequences.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

DiScRIBinATE: a rapid method for accurate taxonomic classification of metagenomic sequences.
| S-EPMC2957682 | biostudies-literature

INDUS - a composition-based approach for rapid and accurate taxonomic classification of metagenomic sequences.
| S-EPMC3333187 | biostudies-literature

Accurate multiple alignment of distantly related genome sequences using filtered spaced word matches as anchor points.
| S-EPMC6330006 | biostudies-literature

UPP2: fast and accurate alignment of datasets with fragmentary sequences.
| S-EPMC9846425 | biostudies-literature

Accurate multiple sequence-structure alignment of RNA sequences using combinatorial optimization.
| S-EPMC1955456 | biostudies-literature

Accurate detection of m6A RNA modifications in native RNA sequences [Yeast]
2019-09-09 | GSE126213 | GEO

PatMaN: rapid alignment of short sequences to large databases.
| S-EPMC2718670 | biostudies-literature

INSIDER: alignment-free detection of foreign DNA sequences.
| S-EPMC8273350 | biostudies-literature

Accurate detection of m6A RNA modifications in native RNA sequences [Curlcake constructs]
2019-07-10 | GSE124309 | GEO

Centrifuge: rapid and sensitive classification of metagenomic sequences.
| S-EPMC5131823 | biostudies-literature