Dataset Information

Protein contact prediction by integrating deep multiple sequence alignments, coevolution and machine learning.

ABSTRACT: In this study, we report the evaluation of the residue-residue contacts predicted by our three different methods in the CASP12 experiment, focusing on studying the impact of multiple sequence alignment, residue coevolution, and machine learning on contact prediction. The first method (MULTICOM-NOVEL) uses only traditional features (sequence profile, secondary structure, and solvent accessibility) with deep learning to predict contacts and serves as a baseline. The second method (MULTICOM-CONSTRUCT) uses our new alignment algorithm to generate deep multiple sequence alignment to derive coevolution-based features, which are integrated by a neural network method to predict contacts. The third method (MULTICOM-CLUSTER) is a consensus combination of the predictions of the first two methods. We evaluated our methods on 94 CASP12 domains. On a subset of 38 free-modeling domains, our methods achieved an average precision of up to 41.7% for top L/5 long-range contact predictions. The comparison of the three methods shows that the quality and effective depth of multiple sequence alignments, coevolution-based features, and machine learning integration of coevolution-based features and traditional features drive the quality of predicted protein contacts. On the full CASP12 dataset, the coevolution-based features alone can improve the average precision from 28.4% to 41.6%, and the machine learning integration of all the features further raises the precision to 56.3%, when top L/5 predicted long-range contacts are evaluated. And the correlation between the precision of contact prediction and the logarithm of the number of effective sequences in alignments is 0.66.

SUBMITTER: Adhikari B

PROVIDER: S-EPMC5820155 | biostudies-literature | 2018 Mar

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Protein contact prediction by integrating deep multiple sequence alignments, coevolution and machine learning.

Adhikari Badri B Hou Jie J Cheng Jianlin J

Proteins 20171031

In this study, we report the evaluation of the residue-residue contacts predicted by our three different methods in the CASP12 experiment, focusing on studying the impact of multiple sequence alignment, residue coevolution, and machine learning on contact prediction. The first method (MULTICOM-NOVEL) uses only traditional features (sequence profile, secondary structure, and solvent accessibility) with deep learning to predict contacts and serves as a baseline. The second method (MULTICOM-CONSTRU ...[more]

PMID: 29047157

Similar Datasets

Project description:BackgroundWhile the conserved positions of a multiple sequence alignment (MSA) are clearly of interest, non-conserved positions can also be important because, for example, destabilizing effects at one position can be compensated by stabilizing effects at another position. Different methods have been developed to recognize the evolutionary relationship between amino acid sites, and to disentangle functional/structural dependencies from historical/phylogenetic ones.Methodology/principal findingsWe have used two complementary approaches to test the efficacy of these methods. In the first approach, we have used a new program, MSAvolve, for the in silico evolution of MSAs, which records a detailed history of all covarying positions, and builds a global coevolution matrix as the accumulated sum of individual matrices for the positions forced to co-vary, the recombinant coevolution, and the stochastic coevolution. We have simulated over 1600 MSAs for 8 protein families, which reflect sequences of different sizes and proteins with widely different functions. The calculated coevolution matrices were compared with the coevolution matrices obtained for the same evolved MSAs with different coevolution detection methods. In a second approach we have evaluated the capacity of the different methods to predict close contacts in the representative X-ray structures of an additional 150 protein families using only experimental MSAs.Conclusions/significanceMethods based on the identification of global correlations between pairs were found to be generally superior to methods based only on local correlations in their capacity to identify coevolving residues using either simulated or experimental MSAs. However, the significant variability in the performance of different methods with different proteins suggests that the simulation of MSAs that replicate the statistical properties of the experimental MSA can be a valuable tool to identify the coevolution detection method that is most effective in each case.

Project description:Background: Long non-coding RNAs (lncRNAs) play an important role in the immune regulation of gastric cancer (GC). However, the clinical application value of immune-related lncRNAs has not been fully developed. It is of great significance to overcome the challenges of prognostic prediction and classification of gastric cancer patients based on the current study. Methods: In this study, the R package ImmLnc was used to obtain immune-related lncRNAs of The Cancer Genome Atlas Stomach Adenocarcinoma (TCGA-STAD) project, and univariate Cox regression analysis was performed to find prognostic immune-related lncRNAs. A total of 117 combinations based on 10 algorithms were integrated to determine the immune-related lncRNA prognostic model (ILPM). According to the ILPM, the least absolute shrinkage and selection operator (LASSO) regression was employed to find the major lncRNAs and develop the risk model. ssGSEA, CIBERSORT algorithm, the R package maftools, pRRophetic, and clusterProfiler were employed for measuring the proportion of immune cells among risk groups, genomic mutation difference, drug sensitivity analysis, and pathway enrichment score. Results: A total of 321 immune-related lncRNAs were found, and there were 26 prognostic immune-related lncRNAs. According to the ILPM, 18 of 26 lncRNAs were selected and the risk score (RS) developed by the 18-lncRNA signature had good strength in the TCGA training set and Gene Expression Omnibus (GEO) validation datasets. Patients were divided into high- and low-risk groups according to the median RS, and the low-risk group had a better prognosis, tumor immune microenvironment, and tumor signature enrichment score and a higher metabolism, frequency of genomic mutations, proportion of immune cell infiltration, and antitumor drug resistance. Furthermore, 86 differentially expressed genes (DEGs) between high- and low-risk groups were mainly enriched in immune-related pathways. Conclusion: The ILPM developed based on 26 prognostic immune-related lncRNAs can help in predicting the prognosis of patients suffering from gastric cancer. Precision medicine can be effectively carried out by dividing patients into high- and low-risk groups according to the RS.

Dataset Information

Protein contact prediction by integrating deep multiple sequence alignments, coevolution and machine learning.

Publications

Protein contact prediction by integrating deep multiple sequence alignments, coevolution and machine learning.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets