Dataset Information

Globally, unrelated protein sequences appear random.

ABSTRACT:

Motivation

To test whether protein folding constraints and secondary structure sequence preferences significantly reduce the space of amino acid words in proteins, we compared the frequencies of four- and five-amino acid word clumps (independent words) in proteins to the frequencies predicted by four random sequence models.

Results

While the human proteome has many overrepresented word clumps, these words come from large protein families with biased compositions (e.g. Zn-fingers). In contrast, in a non-redundant sample of Pfam-AB, only 1% of four-amino acid word clumps (4.7% of 5mer words) are 2-fold overrepresented compared with our simplest random model [MC(0)], and 0.1% (4mers) to 0.5% (5mers) are 2-fold overrepresented compared with a window-shuffled random model. Using a false discovery rate q-value analysis, the number of exceptional four- or five-letter words in real proteins is similar to the number found when comparing words from one random model to another. Consensus overrepresented words are not enriched in conserved regions of proteins, but four-letter words are enriched 1.18- to 1.56-fold in alpha-helical secondary structures (but not beta-strands). Five-residue consensus exceptional words are enriched for alpha-helix 1.43- to 1.61-fold. Protein word preferences in regular secondary structure do not appear to significantly restrict the use of sequence words in unrelated proteins, although the consensus exceptional words have a secondary structure bias for alpha-helix. Globally, words in protein sequences appear to be under very few constraints; for the most part, they appear to be random.

Supplementary information

Supplementary data are available at Bioinformatics online.

SUBMITTER: Lavelle DT

PROVIDER: S-EPMC2852211 | biostudies-literature | 2010 Feb

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Globally, unrelated protein sequences appear random.

Lavelle Daniel T DT Pearson William R WR

Bioinformatics (Oxford, England) 20091130 3

<h4>Motivation</h4>To test whether protein folding constraints and secondary structure sequence preferences significantly reduce the space of amino acid words in proteins, we compared the frequencies of four- and five-amino acid word clumps (independent words) in proteins to the frequencies predicted by four random sequence models.<h4>Results</h4>While the human proteome has many overrepresented word clumps, these words come from large protein families with biased compositions (e.g. Zn-fingers). ...[more]

PMID: 19948773

Similar Datasets

Project description:The study of protein-protein interactions (PPIs) can be very important for the understanding of biological cellular functions. However, detecting PPIs in the laboratories are both time-consuming and expensive. For this reason, there has been much recent effort to develop techniques for computational prediction of PPIs as this can complement laboratory procedures and provide an inexpensive way of predicting the most likely set of interactions at the entire proteome scale. Although much progress has already been achieved in this direction, the problem is still far from being solved. More effective approaches are still required to overcome the limitations of the current ones. In this study, a novel Multi-scale Local Descriptor (MLD) feature representation scheme is proposed to extract features from a protein sequence. This scheme can capture multi-scale local information by varying the length of protein-sequence segments. Based on the MLD, an ensemble learning method, the Random Forest (RF) method, is used as classifier. The MLD feature representation scheme facilitates the mining of interaction information from multi-scale continuous amino acid segments, making it easier to capture multiple overlapping continuous binding patterns within a protein sequence. When the proposed method is tested with the PPI data of Saccharomyces cerevisiae, it achieves a prediction accuracy of 94.72% with 94.34% sensitivity at the precision of 98.91%. Extensive experiments are performed to compare our method with existing sequence-based method. Experimental results show that the performance of our predictor is better than several other state-of-the-art predictors also with the H. pylori dataset. The reason why such good results are achieved can largely be credited to the learning capabilities of the RF model and the novel MLD feature representation scheme. The experiment results show that the proposed approach can be very promising for predicting PPIs and can be a useful tool for future proteomic studies.

Project description:BackgroundMalaria caused by zoonotic Plasmodium knowlesi is an emerging threat in Eastern Malaysia. Despite demonstrated vector competency, it is unknown whether human-to-human (H-H) transmission is occurring naturally. We sought evidence of drug selection pressure from the antimalarial sulfadoxine-pyrimethamine (SP) as a potential marker of H-H transmission.MethodsThe P. knowlesi dihdyrofolate-reductase (pkdhfr) gene was sequenced from 449 P. knowlesi malaria cases from Sabah (Malaysian Borneo) and genotypes evaluated for association with clinical and epidemiological factors. Homology modelling using the pvdhfr template was used to assess the effect of pkdhfr mutations on the pyrimethamine binding pocket.ResultsFourteen non-synonymous mutations were detected, with the most common being at codon T91P (10.2%) and R34L (10.0%), resulting in 21 different genotypes, including the wild-type, 14 single mutants, and six double mutants. One third of the P. knowlesi infections were with pkdhfr mutants; 145 (32%) patients had single mutants and 14 (3%) had double-mutants. In contrast, among the 47 P. falciparum isolates sequenced, three pfdhfr genotypes were found, with the double mutant 108N+59R being fixed and the triple mutants 108N+59R+51I and 108N+59R+164L occurring with frequencies of 4% and 8%, respectively. Two non-random spatio-temporal clusters were identified with pkdhfr genotypes. There was no association between pkdhfr mutations and hyperparasitaemia or malaria severity, both hypothesized to be indicators of H-H transmission. The orthologous loci associated with resistance in P. falciparum were not mutated in pkdhfr. Subsequent homology modelling of pkdhfr revealed gene loci 13, 53, 120, and 173 as being critical for pyrimethamine binding, however, there were no mutations at these sites among the 449 P. knowlesi isolates.ConclusionAlthough moderate diversity was observed in pkdhfr in Sabah, there was no evidence this reflected selective antifolate drug pressure in humans.

Dataset Information

Globally, unrelated protein sequences appear random.

Motivation

Results

Supplementary information

Publications

Globally, unrelated protein sequences appear random.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets