Dataset Information

Adaptive GDDA-BLAST: fast and efficient algorithm for protein sequence embedding.

ABSTRACT: A major computational challenge in the genomic era is annotating structure/function to the vast quantities of sequence information that is now available. This problem is illustrated by the fact that most proteins lack comprehensive annotations, even when experimental evidence exists. We previously theorized that embedded-alignment profiles (simply "alignment profiles" hereafter) provide a quantitative method that is capable of relating the structural and functional properties of proteins, as well as their evolutionary relationships. A key feature of alignment profiles lies in the interoperability of data format (e.g., alignment information, physio-chemical information, genomic information, etc.). Indeed, we have demonstrated that the Position Specific Scoring Matrices (PSSMs) are an informative M-dimension that is scored by quantitatively measuring the embedded or unmodified sequence alignments. Moreover, the information obtained from these alignments is informative, and remains so even in the "twilight zone" of sequence similarity (<25% identity). Although our previous embedding strategy was powerful, it suffered from contaminating alignments (embedded AND unmodified) and high computational costs. Herein, we describe the logic and algorithmic process for a heuristic embedding strategy named "Adaptive GDDA-BLAST." Adaptive GDDA-BLAST is, on average, up to 19 times faster than, but has similar sensitivity to our previous method. Further, data are provided to demonstrate the benefits of embedded-alignment measurements in terms of detecting structural homology in highly divergent protein sequences and isolating secondary structural elements of transmembrane and ankyrin-repeat domains. Together, these advances allow further exploration of the embedded alignment data space within sufficiently large data sets to eventually induce relevant statistical inferences. We show that sequence embedding could serve as one of the vehicles for measurement of low-identity alignments and for incorporation thereof into high-performance PSSM-based alignment profiles.

SUBMITTER: Hong Y

PROVIDER: S-EPMC2962639 | biostudies-literature | 2010 Oct

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Adaptive GDDA-BLAST: fast and efficient algorithm for protein sequence embedding.

Hong Yoojin Y Kang Jaewoo J Lee Dongwon D van Rossum Damian B DB

PloS one 20101022 10

A major computational challenge in the genomic era is annotating structure/function to the vast quantities of sequence information that is now available. This problem is illustrated by the fact that most proteins lack comprehensive annotations, even when experimental evidence exists. We previously theorized that embedded-alignment profiles (simply "alignment profiles" hereafter) provide a quantitative method that is capable of relating the structural and functional properties of proteins, as wel ...[more]

PMID: 21042584

Dataset Information

Adaptive GDDA-BLAST: fast and efficient algorithm for protein sequence embedding.

Publications

Adaptive GDDA-BLAST: fast and efficient algorithm for protein sequence embedding.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

PGAGP: Predicting pathogenic genes based on adaptive network embedding algorithm.
| S-EPMC9895109 | biostudies-literature

BLVector: Fast BLAST-Like Algorithm for Manycore CPU With Vectorization.
| S-EPMC7884812 | biostudies-literature

Sequence embedding for fast construction of guide trees for multiple sequence alignment.
| S-EPMC2893182 | biostudies-literature

Fast surface reconstruction algorithm with adaptive step size.
| S-EPMC11771931 | biostudies-literature

FOGSAA: Fast Optimal Global Sequence Alignment Algorithm.
| S-EPMC3638164 | biostudies-literature

Survey of Protein Sequence Embedding Models.
| S-EPMC9963412 | biostudies-literature

Self-adaptive multiscaling algorithm for efficient simulations of many-protein systems in crowded conditions.
| S-EPMC6752035 | biostudies-literature

Simple adjustment of the sequence weight algorithm remarkably enhances PSI-BLAST performance.
| S-EPMC5455086 | biostudies-literature

An efficient functional magnetic resonance imaging data reduction strategy using neighborhood preserving embedding algorithm.
| S-EPMC8886658 | biostudies-literature

R-PASS: A Fast Structure-based RNA Sequence Alignment Algorithm.
| S-EPMC3999979 | biostudies-literature