Dataset Information

An analysis of single amino acid repeats as use case for application specific background models.

ABSTRACT:

Background

Sequence analysis aims to identify biologically relevant signals against a backdrop of functionally meaningless variation. Increasingly, it is recognized that the quality of the background model directly affects the performance of analyses. State-of-the-art approaches rely on classical sequence models that are adapted to the studied dataset. Although performing well in the analysis of globular protein domains, these models break down in regions of stronger compositional bias or low complexity. While these regions are typically filtered, there is increasing anecdotal evidence of functional roles. This motivates an exploration of more complex sequence models and application-specific approaches for the investigation of biased regions.

Results

Traditional Markov-chains and application-specific regression models are compared using the example of predicting runs of single amino acids, a particularly simple class of biased regions. Cross-fold validation experiments reveal that the alternative regression models capture the multi-variate trends well, despite their low dimensionality and in contrast even to higher-order Markov-predictors. We show how the significance of unusual observations can be computed for such empirical models. The power of a dedicated model in the detection of biologically interesting signals is then demonstrated in an analysis identifying the unexpected enrichment of contiguous leucine-repeats in signal-peptides. Considering different reference sets, we show how the question examined actually defines what constitutes the 'background'. Results can thus be highly sensitive to the choice of appropriate model training sets. Conversely, the choice of reference data determines the questions that can be investigated in an analysis.

Conclusions

Using a specific case of studying biased regions as an example, we have demonstrated that the construction of application-specific background models is both necessary and feasible in a challenging sequence analysis situation.

SUBMITTER: Labaj PP

PROVIDER: S-EPMC3124433 | biostudies-literature | 2011 May

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

An analysis of single amino acid repeats as use case for application specific background models.

Łabaj Paweł P PP Sykacek Peter P Kreil David P DP

BMC bioinformatics 20110519

<h4>Background</h4>Sequence analysis aims to identify biologically relevant signals against a backdrop of functionally meaningless variation. Increasingly, it is recognized that the quality of the background model directly affects the performance of analyses. State-of-the-art approaches rely on classical sequence models that are adapted to the studied dataset. Although performing well in the analysis of globular protein domains, these models break down in regions of stronger compositional bias o ...[more]

PMID: 21595908

Dataset Information

An analysis of single amino acid repeats as use case for application specific background models.

Background

Results

Conclusions

Publications

An analysis of single amino acid repeats as use case for application specific background models.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

COPASAAR--a database for proteomic analysis of single amino acid repeats.
| S-EPMC1199582 | biostudies-literature

Single Amino Acid Repeats in the Proteome World: Structural, Functional, and Evolutionary Insights.
| S-EPMC5125637 | biostudies-literature

Search for Highly Divergent Tandem Repeats in Amino Acid Sequences.
| S-EPMC8269118 | biostudies-literature

Dishevelled-3 C-terminal His single amino acid repeats are obligate for Wnt5a activation of non-canonical signaling.
| S-EPMC3003240 | biostudies-literature

Comprehensive analysis of tandem amino acid repeats from ten angiosperm genomes.
| S-EPMC3283746 | biostudies-literature

Mutation patterns of amino acid tandem repeats in the human proteome.
| S-EPMC1557989 | biostudies-literature

The origin of conserved protein domains and amino acid repeats via adaptive competition for control over amino acid residues.
| S-EPMC3368225 | biostudies-literature

Tandem and cryptic amino acid repeats accumulate in disordered regions of proteins.
| S-EPMC2718493 | biostudies-literature

Ab initio detection of fuzzy amino acid tandem repeats in protein sequences.
| S-EPMC3402919 | biostudies-literature