Dataset Information

The choice of sequence homologs included in multiple sequence alignments has a dramatic impact on evolutionary conservation analysis.

ABSTRACT:

Motivation

The analysis of sequence conservation patterns has been widely utilized to identify functionally important (catalytic and ligand-binding) protein residues for over a half-century. Despite decades of development, on average state-of-the-art non-template-based functional residue prediction methods must predict ∼25% of a protein's total residues to correctly identify half of the protein's functional site residues. The overwhelming proportion of false positives results in reported 'F-Scores' of ∼0.3. We investigated the limits of current approaches, focusing on the so-far neglected impact of the specific choice of homologs included in multiple sequence alignments (MSAs).

Results

The limits of conservation-based functional residue prediction were explored by surveying the binding sites of 1023 proteins. A straightforward conservation analysis of MSAs composed of randomly selected homologs sampled from a PSI-BLAST search achieves average F-Scores of ∼0.3, a performance matching that reported by state-of-the-art methods, which often consider additional features for the prediction in a machine learning setting. Interestingly, we found that a simple combinatorial MSA sampling algorithm will in almost every case produce an MSA with an optimal set of homologs whose conservation analysis reaches average F-Scores of ∼0.6, doubling state-of-the-art performance. We also show that this is nearly at the theoretical limit of possible performance given the agreement between different binding site definitions. Additionally, we showcase the progress in this direction made by Selection of Alignment by Maximal Mutual Information (SAMMI), an information-theory-based approach to identifying biologically informative MSAs. This work highlights the importance and the unused potential of optimally composed MSAs for conservation analysis.

Supplementary information

Supplementary data are available at Bioinformatics online.

SUBMITTER: Gil N

PROVIDER: S-EPMC6298051 | biostudies-literature | 2019 Jan

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

The choice of sequence homologs included in multiple sequence alignments has a dramatic impact on evolutionary conservation analysis.

Gil Nelson N Fiser Andras A

Bioinformatics (Oxford, England) 20190101 1

<h4>Motivation</h4>The analysis of sequence conservation patterns has been widely utilized to identify functionally important (catalytic and ligand-binding) protein residues for over a half-century. Despite decades of development, on average state-of-the-art non-template-based functional residue prediction methods must predict ∼25% of a protein's total residues to correctly identify half of the protein's functional site residues. The overwhelming proportion of false positives results in reported ...[more]

PMID: 29947739

Dataset Information

The choice of sequence homologs included in multiple sequence alignments has a dramatic impact on evolutionary conservation analysis.

Motivation

Results

Supplementary information

Publications

The choice of sequence homologs included in multiple sequence alignments has a dramatic impact on evolutionary conservation analysis.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

Evolutionary profiles from the QR factorization of multiple sequence alignments.
| S-EPMC554820 | biostudies-literature

Dramatic impact of metric choice on biogeographical regionalization.
| S-EPMC7195599 | biostudies-literature

Progressive multiple sequence alignments from triplets.
| S-EPMC1948021 | biostudies-literature

H2r: identification of evolutionary important residues by means of an entropy based analysis of multiple sequence alignments.
| S-EPMC2323388 | biostudies-literature

Obtaining extremely large and accurate protein multiple sequence alignments from curated hierarchical alignments.
| S-EPMC7297217 | biostudies-literature

A minimum reporting standard for multiple sequence alignments.
| S-EPMC7671350 | biostudies-literature

Refining multiple sequence alignments with conserved core regions.
| S-EPMC1463900 | biostudies-literature

OD-seq: outlier detection in multiple sequence alignments.
| S-EPMC4548304 | biostudies-literature

MSAViewer: interactive JavaScript visualization of multiple sequence alignments.
| S-EPMC5181560 | biostudies-literature

Remarkable Evolutionary Conservation of Antiobesity ADIPOSE/WDTC1 Homologs in Animals and Plants.
| S-EPMC5586369 | biostudies-literature