Dataset Information

Dissecting protein loops with a statistical scalpel suggests a functional implication of some structural motifs.

ABSTRACT: BACKGROUND: One of the strategies for protein function annotation is to search particular structural motifs that are known to be shared by proteins with a given function. RESULTS: Here, we present a systematic extraction of structural motifs of seven residues from protein loops and we explore their correspondence with functional sites. Our approach is based on the structural alphabet HMM-SA (Hidden Markov Model - Structural Alphabet), which allows simplification of protein structures into uni-dimensional sequences, and advanced pattern statistics adapted to short sequences. Structural motifs of interest are selected by looking for structural motifs significantly over-represented in SCOP superfamilies in protein loops. We discovered two types of structural motifs significantly over-represented in SCOP superfamilies: (i) ubiquitous motifs, shared by several superfamilies and (ii) superfamily-specific motifs, over-represented in few superfamilies. A comparison of ubiquitous words with known small structural motifs shows that they contain well-described motifs as turn, niche or nest motifs. A comparison between superfamily-specific motifs and biological annotations of Swiss-Prot reveals that some of them actually correspond to functional sites involved in the binding sites of small ligands, such as ATP/GTP, NAD(P) and SAH/SAM. CONCLUSIONS: Our findings show that statistical over-representation in SCOP superfamilies is linked to functional features. The detection of over-represented motifs within structures simplified by HMM-SA is therefore a promising approach for prediction of functional sites and annotation of uncharacterized proteins.

SUBMITTER: Regad L

PROVIDER: S-EPMC3158783 | biostudies-literature | 2011

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Dissecting protein loops with a statistical scalpel suggests a functional implication of some structural motifs.

Regad Leslie L Martin Juliette J Camproux Anne-Claude AC

BMC bioinformatics 20110620

<h4>Background</h4>One of the strategies for protein function annotation is to search particular structural motifs that are known to be shared by proteins with a given function.<h4>Results</h4>Here, we present a systematic extraction of structural motifs of seven residues from protein loops and we explore their correspondence with functional sites. Our approach is based on the structural alphabet HMM-SA (Hidden Markov Model - Structural Alphabet), which allows simplification of protein structure ...[more]

PMID: 21689388

Similar Datasets

Project description:BACKGROUND: Protein loops encompass 50% of protein residues in available three-dimensional structures. These regions are often involved in protein functions, e.g. binding site, catalytic pocket... However, the description of protein loops with conventional tools is an uneasy task. Regular secondary structures, helices and strands, have been widely studied whereas loops, because they are highly variable in terms of sequence and structure, are difficult to analyze. Due to data sparsity, long loops have rarely been systematically studied. RESULTS: We developed a simple and accurate method that allows the description and analysis of the structures of short and long loops using structural motifs without restriction on loop length. This method is based on the structural alphabet HMM-SA. HMM-SA allows the simplification of a three-dimensional protein structure into a one-dimensional string of states, where each state is a four-residue prototype fragment, called structural letter. The difficult task of the structural grouping of huge data sets is thus easily accomplished by handling structural letter strings as in conventional protein sequence analysis. We systematically extracted all seven-residue fragments in a bank of 93000 protein loops and grouped them according to the structural-letter sequence, named structural word. This approach permits a systematic analysis of loops of all sizes since we consider the structural motifs of seven residues rather than complete loops. We focused the analysis on highly recurrent words of loops (observed more than 30 times). Our study reveals that 73% of loop-lengths are covered by only 3310 highly recurrent structural words out of 28274 observed words). These structural words have low structural variability (mean RMSd of 0.85 A). As expected, half of these motifs display a flanking-region preference but interestingly, two thirds are shared by short (less than 12 residues) and long loops. Moreover, half of recurrent motifs exhibit a significant level of amino-acid conservation with at least four significant positions and 87% of long loops contain at least one such word. We complement our analysis with the detection of statistically over-represented patterns of structural letters as in conventional DNA sequence analysis. About 30% (930) of structural words are over-represented, and cover about 40% of loop lengths. Interestingly, these words exhibit lower structural variability and higher sequential specificity, suggesting structural or functional constraints. CONCLUSIONS: We developed a method to systematically decompose and study protein loops using recurrent structural motifs. This method is based on the structural alphabet HMM-SA and not on structural alignment and geometrical parameters. We extracted meaningful structural motifs that are found in both short and long loops. To our knowledge, it is the first time that pattern mining helps to increase the signal-to-noise ratio in protein loops. This finding helps to better describe protein loops and might permit to decrease the complexity of long-loop analysis. Detailed results are available at http://www.mti.univ-paris-diderot.fr/publication/supplementary/2009/ACCLoop/.

Project description:Understanding protein-protein interactions (PPIs) at the molecular level may lead to innovations in medicine and biochemistry. The assumption that there are certain "hot spots" on protein surfaces that mediate their interactions with other proteins has led to a search for specific sequences involved in protein-protein contacts. In this work, we analyze sequential amino acid motifs, both at the single motif and at the motif-motif level, across a large and diverse dataset of biologically relevant protein-protein interfaces retrieved from the PDB, comparing their presence at interfaces and surfaces in a statistically rigorous manner. At the single motif level, our results indicate statistically significant over-presence of hydrophobic and in particular aromatic residues and under-presence of charged residues at protein-protein interfaces. Certain PPI-mediating motifs reported in the literature (e.g., the Tyrosine-based Motif YxxΦ and the PDZ-Binding Motif X-S/T-X-V/I) were confirmed to have a significant presence at interfaces. In addition, multiple PPI-mediating motifs were reported in the ELM database and from those present in our dataset, half were confirmed to have a statistically significant presence at interfaces whereas others were not. At the single residue, motif-motif level, Cysteine-Cysteine contacts were found to be the most abundant ones followed by interactions involving aromatic/hydrophobic residues. Top ranking, longer motif-motif pairs show predominance of Leucine and aromatic residues. Finally, preliminary energy calculations (using the MM/GBSA procedure) indicate a partial correlation between the probability of motifs-pair to be a part of a protein-protein interface and the strength of the interactions between the motifs. In conclusion, this study points to specific characteristics of motifs that have a higher probability to mediate protein-protein interactions. Prominent motifs identified in this study may be used in the future as possible components in protein engineering.

Project description:BackgroundA large number of PROSITE patterns select false positives and/or miss known true positives. It is possible that--at least in some cases--the weak specificity and/or sensitivity of a pattern is due to the fact that one, or maybe more, functional and/or structural key residues are not represented in the pattern. Multiple sequence alignments are commonly used to build functional sequence patterns. If residues structurally conserved in proteins sharing a function cannot be aligned in a multiple sequence alignment, they are likely to be missed in a standard pattern construction procedure.ResultsHere we present a new procedure aimed at improving the sensitivity and/ or specificity of poorly-performing patterns. The procedure can be summarised as follows: 1. residues structurally conserved in different proteins, that are true positives for a pattern, are identified by means of a computational technique and by visual inspection. 2. the sequence positions of the structurally conserved residues falling outside the pattern are used to build extended sequence patterns. 3. the extended patterns are optimised on the SWISS-PROT database for their sensitivity and specificity. The method was applied to eight PROSITE patterns. Whenever structurally conserved residues are found in the surface region close to the pattern (seven out of eight cases), the addition of information inferred from structural analysis is shown to improve pattern selectivity and in some cases selectivity and sensitivity as well. In some of the cases considered the procedure allowed the identification of functionally interesting residues, whose biological role is also discussed.ConclusionOur method can be applied to any type of functional motif or pattern (not only PROSITE ones) which is not able to select all and only the true positive hits and for which at least two true positive structures are available. The computational technique for the identification of structurally conserved residues is already available on request and will be soon accessible on our web server. The procedure is intended for the use of pattern database curators and of scientists interested in a specific protein family for which no specific or selective patterns are yet available.

Project description:Influenza A viruses (IAV) are responsible for recurrent influenza epidemics and occasional devastating pandemics in humans and animals. They belong to the Orthomyxoviridae family and their genome consists of eight (-) sense viral RNA (vRNA) segments of different lengths coding for at least 11 viral proteins. A heterotrimeric polymerase complex is bound to the promoter consisting of the 13 5'-terminal and 12 3'-terminal nucleotides of each vRNA, while internal parts of the vRNAs are associated with multiple copies of the viral nucleoprotein (NP), thus forming ribonucleoproteins (vRNP). Transcription and replication of vRNAs result in viral mRNAs (vmRNAs) and complementary RNAs (cRNAs), respectively. Complementary RNAs are the exact positive copies of vRNAs; they also form ribonucleoproteins (cRNPs) and are intermediate templates in the vRNA amplification process. On the contrary, vmRNAs have a 5' cap snatched from cellular mRNAs and a 3' polyA tail, both gained by the viral polymerase complex. Hence, unlike vRNAs and cRNAs, vmRNAs do not have a terminal promoter able to recruit the viral polymerase. Furthermore, synthesis of at least two viral proteins requires vmRNA splicing. Except for extensive analysis of the viral promoter structure and function and a few, mostly bioinformatics, studies addressing the vRNA and vmRNA structure, structural studies of the influenza A vRNAs, cRNAs, and vmRNAs are still in their infancy. The recent crystal structures of the influenza polymerase heterotrimeric complex drastically improved our understanding of the replication and transcription processes. The vRNA structure has been mainly studied in vitro using RNA probing, but its structure has been very recently studied within native vRNPs using crosslinking and RNA probing coupled to next generation RNA sequencing. Concerning vmRNAs, most studies focused on the segment M and NS splice sites and several structures initially predicted by bioinformatics analysis have now been validated experimentally and their role in the viral life cycle demonstrated. This review aims to compile the structural motifs found in the different RNA classes (vRNA, cRNA, and vmRNA) of influenza viruses and their function in the viral replication cycle.

Project description:BACKGROUND: Bacterial populations are highly successful at colonizing new habitats and adapting to changing environmental conditions, partly due to their capacity to evolve novel virulence and metabolic pathways in response to stress conditions and to shuffle them by horizontal gene transfer (HGT). A common theme in the evolution of new functions consists of gene duplication followed by functional divergence. UlaG, a unique manganese-dependent metallo-?-lactamase (MBL) enzyme involved in L-ascorbate metabolism by commensal and symbiotic enterobacteria, provides a model for the study of the emergence of new catalytic activities from the modification of an ancient fold. Furthermore, UlaG is the founding member of the so-called UlaG-like (UlaGL) protein family, a recently established and poorly characterized family comprising divalent (and perhaps trivalent) metal-binding MBLs that catalyze transformations on phosphorylated sugars and nucleotides. RESULTS: Here we combined protein structure-guided and sequence-only molecular phylogenetic analyses to dissect the molecular evolution of UlaG and to study its phylogenomic distribution, its relatedness with present-day UlaGL protein sequences and functional conservation. Phylogenetic analyses indicate that UlaGL sequences are present in Bacteria and Archaea, with bona fide orthologs found mainly in mammalian and plant-associated Gram-negative and Gram-positive bacteria. The incongruence between the UlaGL tree and known species trees indicates exchange by HGT and suggests that the UlaGL-encoding genes provided a growth advantage under changing conditions. Our search for more distantly related protein sequences aided by structural homology has uncovered that UlaGL sequences have a common evolutionary origin with present-day RNA processing and metabolizing MBL enzymes widespread in Bacteria, Archaea, and Eukarya. This observation suggests an ancient origin for the UlaGL family within the broader trunk of the MBL superfamily by duplication, neofunctionalization and fixation. CONCLUSIONS: Our results suggest that the forerunner of UlaG was present as an RNA metabolizing enzyme in the last common ancestor, and that the modern descendants of that ancestral gene have a wide phylogenetic distribution and functional roles. We propose that the UlaGL family evolved new metabolic roles among bacterial and possibly archeal phyla in the setting of a close association with metazoans, such as in the mammalian gastrointestinal tract or in animal and plant pathogens, as well as in environmental settings. Accordingly, the major evolutionary forces shaping the UlaGL family include vertical inheritance and lineage-specific duplication and acquisition of novel metabolic functions, followed by HGT and numerous lineage-specific gene loss events.

Dataset Information

Dissecting protein loops with a statistical scalpel suggests a functional implication of some structural motifs.

Publications

Dissecting protein loops with a statistical scalpel suggests a functional implication of some structural motifs.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets