Dataset Information

Improved mutation tagging with gene identifiers applied to membrane protein stability prediction.

ABSTRACT:

Background

The automated retrieval and integration of information about protein point mutations in combination with structure, domain and interaction data from literature and databases promises to be a valuable approach to study structure-function relationships in biomedical data sets.

Results

We developed a rule- and regular expression-based protein point mutation retrieval pipeline for PubMed abstracts, which shows an F-measure of 87% for the mutation retrieval task on a benchmark dataset. In order to link mutations to their proteins, we utilize a named entity recognition algorithm for the identification of gene names co-occurring in the abstract, and establish links based on sequence checks. Vice versa, we could show that gene recognition improved from 77% to 91% F-measure when considering mutation information given in the text. To demonstrate practical relevance, we utilize mutation information from text to evaluate a novel solvation energy based model for the prediction of stabilizing regions in membrane proteins. For five G protein-coupled receptors we identified 35 relevant single mutations and associated phenotypes, of which none had been annotated in the UniProt or PDB database. In 71% reported phenotypes were in compliance with the model predictions, supporting a relation between mutations and stability issues in membrane proteins.

Conclusion

We present a reliable approach for the retrieval of protein mutations from PubMed abstracts for any set of genes or proteins of interest. We further demonstrate how amino acid substitution information from text can be utilized for protein structure stability studies on the basis of a novel energy model.

SUBMITTER: Winnenburg R

PROVIDER: S-EPMC2745585 | biostudies-literature | 2009 Aug

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Improved mutation tagging with gene identifiers applied to membrane protein stability prediction.

Winnenburg Rainer R Plake Conrad C Schroeder Michael M

BMC bioinformatics 20090827

<h4>Background</h4>The automated retrieval and integration of information about protein point mutations in combination with structure, domain and interaction data from literature and databases promises to be a valuable approach to study structure-function relationships in biomedical data sets.<h4>Results</h4>We developed a rule- and regular expression-based protein point mutation retrieval pipeline for PubMed abstracts, which shows an F-measure of 87% for the mutation retrieval task on a benchma ...[more]

PMID: 19758467

Similar Datasets

Project description:BackgroundOur objective was to assess whether modifications to a customized targeted RNA sequencing (RNAseq) assay to include unique molecular identifiers (UMIs) that collapse read counts to their source mRNA counts would improve quantification of transcripts from formalin-fixed paraffin-embedded (FFPE) tumor tissue samples. The assay (SET4) includes signatures that measure hormone receptor and PI3-kinase related transcriptional activity (SETER/PR and PI3Kges), and measures expression of selected activating point mutations and key breast cancer genes.MethodsModifications included steps to introduce eight nucleotides-long UMIs during reverse transcription (RT) in bulk solution, followed by polymerase chain reaction (PCR) of labeled cDNA in droplets, with optimization of the polymerase enzyme and reaction conditions. We used Lin's concordance correlation coefficient (CCC) to measure concordance, including precision (Rho) and accuracy (Bias), and nonparametric tests (Wilcoxon, Levene's) to compare the modified (NEW) SET4 assay to the original (OLD) SET4 assay and to whole transcriptome RNAseq using RNA from matched fresh frozen (FF) and FFPE samples from 12 primary breast cancers.ResultsThe modified (NEW) SET4 assay measured single transcripts (p< 0.001) and SETER/PR (p=0.002) more reproducibly in technical replicates from FFPE samples. The modified SET4 assay was more precise for measuring single transcripts (Rho 0.966 vs 0.888, p< 0.01) but not multigene expression signatures SETER/PR (Rho 0.985 vs 0.968) or PI3Kges (Rho 0.985 vs 0.946) in FFPE, compared to FF samples. It was also more precise than wtRNAseq of FFPE for measuring transcripts (Rho 0.986 vs 0.934, p< 0.001) and SETER/PR (Rho 0.993 vs 0.915, p=0.004), but not PI3Kges (Rho 0.988 vs 0.945, p=0.051). Accuracy (Bias) was comparable between protocols. Two samples carried a PIK3CA mutation, and measurements of transcribed mutant allele fraction was similar in FF and FFPE samples and appeared more precise with the modified SET4 assay. Amplification efficiency (reads per UMI) was consistent in FF and FFPE samples, and close to the theoretically expected value, when the library size exceeded 400,000 aligned reads.ConclusionsModifications to the targeted RNAseq protocol for SET4 assay significantly increased the precision of UMI-based and reads-based measurements of individual transcripts, multi-gene signatures, and mutant transcript fraction, particularly with FFPE samples.

Project description:MotivationMutations in human genome are mainly through single nucleotide polymorphism, some of which can affect stability and function of proteins, causing human diseases. Several methods have been proposed to predict the effect of mutations on protein stability; but most require features from experimental structure. Given the fast progress in protein structure prediction, this work explores the possibility to improve the mutation-induced stability change prediction using low-resolution structure modeling.ResultsWe developed a new method (STRUM) for predicting stability change caused by single-point mutations. Starting from wild-type sequences, 3D models are constructed by the iterative threading assembly refinement (I-TASSER) simulations, where physics- and knowledge-based energy functions are derived on the I-TASSER models and used to train STRUM models through gradient boosting regression. STRUM was assessed by 5-fold cross validation on 3421 experimentally determined mutations from 150 proteins. The Pearson correlation coefficient (PCC) between predicted and measured changes of Gibbs free-energy gap, ΔΔG, upon mutation reaches 0.79 with a root-mean-square error 1.2 kcal/mol in the mutation-based cross-validations. The PCC reduces if separating training and test mutations from non-homologous proteins, which reflects inherent correlations in the current mutation sample. Nevertheless, the results significantly outperform other state-of-the-art methods, including those built on experimental protein structures. Detailed analyses show that the most sensitive features in STRUM are the physics-based energy terms on I-TASSER models and the conservation scores from multiple-threading template alignments. However, the ΔΔG prediction accuracy has only a marginal dependence on the accuracy of protein structure models as long as the global fold is correct. These data demonstrate the feasibility to use low-resolution structure modeling for high-accuracy stability change prediction upon point mutations.Availability and implementationhttp://zhanglab.ccmb.med.umich.edu/STRUM/ CONTACT: qiang@suda.edu.cn and zhng@umich.eduSupplementary informationSupplementary data are available at Bioinformatics online.

Dataset Information

Improved mutation tagging with gene identifiers applied to membrane protein stability prediction.

Background

Results

Conclusion

Publications

Improved mutation tagging with gene identifiers applied to membrane protein stability prediction.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets