Dataset Information

Protein Structure Classification and Loop Modeling Using Multiple Ramachandran Distributions.

ABSTRACT: Recently, the study of protein structures using angular representations has attracted much attention among structural biologists. The main challenge is how to efficiently model the continuous conformational space of the protein structures based on the differences and similarities between different Ramachandran plots. Despite the presence of statistical methods for modeling angular data of proteins, there is still a substantial need for more sophisticated and faster statistical tools to model the large-scale circular datasets. To address this need, we have developed a nonparametric method for collective estimation of multiple bivariate density functions for a collection of populations of protein backbone angles. The proposed method takes into account the circular nature of the angular data using trigonometric spline which is more efficient compared to existing methods. This collective density estimation approach is widely applicable when there is a need to estimate multiple density functions from different populations with common features. Moreover, the coefficients of adaptive basis expansion for the fitted densities provide a low-dimensional representation that is useful for visualization, clustering, and classification of the densities. The proposed method provides a novel and unique perspective to two important and challenging problems in protein structure research: structure-based protein classification and angular-sampling-based protein loop structure prediction.

SUBMITTER: Najibi SM

PROVIDER: S-EPMC5331158 | biostudies-literature | 2017

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Protein Structure Classification and Loop Modeling Using Multiple Ramachandran Distributions.

Najibi Seyed Morteza SM Maadooliat Mehdi M Zhou Lan L Huang Jianhua Z JZ Gao Xin X

Computational and structural biotechnology journal 20170208

Recently, the study of protein structures using angular representations has attracted much attention among structural biologists. The main challenge is how to efficiently model the continuous conformational space of the protein structures based on the differences and similarities between different Ramachandran plots. Despite the presence of statistical methods for modeling angular data of proteins, there is still a substantial need for more sophisticated and faster statistical tools to model the ...[more]

PMID: 28280526

Similar Datasets

Project description:Accurately predicting loop structures is important for understanding functions of many proteins. In order to obtain loop models with high accuracy, efficiently sampling the loop conformation space to discover reasonable structures is a critical step. In loop conformation sampling, coarse-grain energy (scoring) functions coupling with reduced protein representations are often used to reduce the number of degrees of freedom as well as sampling computational time. However, due to implicitly considering many factors by reduced representations, the coarse-grain scoring functions may have potential insensitivity and inaccuracy, which can mislead the sampling process and consequently ignore important loop conformations. In this paper, we present a new computational sampling approach to obtain reasonable loop backbone models, so-called the Pareto optimal sampling (POS) method. The rationale of the POS method is to sample the function space of multiple, carefully selected scoring functions to discover an ensemble of diversified structures yielding Pareto optimality to all sampled conformations. The POS method can efficiently tolerate insensitivity and inaccuracy in individual scoring functions and thereby lead to significant accuracy improvement in loop structure prediction. We apply the POS method to a set of 4-12-residue loop targets using a function space composed of backbone-only Rosetta and distance-scale finite ideal-gas reference (DFIRE) and a triplet backbone dihedral potential developed in our lab. Our computational results show that in 501 out of 502 targets, the model sets generated by POS contain structure models are within subangstrom resolution. Moreover, the top-ranked models have a root mean square deviation (rmsd) less than 1 A in 96.8, 84.1, and 72.2% of the short (4-6 residues), medium (7-9 residues), and long (10-12 residues) targets, respectively, when the all-atom models are generated by local optimization from the backbone models and are ranked by our recently developed Pareto optimal consensus (POC) method. Similar sampling effectiveness can also be found in a set of 13-residue loop targets.

Project description:Revealing the tertiary structure of proteins holds huge significance as it unveils their vital properties and functions. These intricate three-dimensional configurations comprise diverse interactions including ionic, hydrophobic, and disulfide forces. In certain instances, these structures exhibit missing regions, necessitating the reconstruction of specific segments, thereby resulting in challenges in protein design, which encompasses loop modeling, circular permutation, and interface prediction. To address this problem, we present two pioneering models: pix2pix generative adversarial network (GAN) and PLM-GAN. The pix2pix GAN model is adept at generating and inpainting distance matrices of protein structures, whereas the PLM-GAN model incorporates residual blocks into the U-Net network of the GAN, building upon the foundation of the pix2pix GAN model. To bolster the models' performance, we introduce a novel loss function named the "missing to real regions loss" (LMTR) within the GAN framework. Additionally, we introduce a distinctive approach of pairing two different distance matrices: one representing the native protein structure and the other representing the same structure with a missing region that undergoes changes in each successive epoch. Moreover, we extend the reconstruction of missing regions, encompassing up to 30 amino acids and increase the protein length by 128 amino acids. The evaluation of our pix2pix GAN and PLM-GAN models on a random selection of natural proteins (4ZCB, 3FJB, and 2REZ) demonstrated promising experimental results. Our models constitute significant contributions to addressing intricate challenges in protein structure design. These contributions hold immense potential to propel advancements in protein-protein interactions, drug design, and further innovations in protein engineering. Data, code, trained models, examples, and measurements are available on https://github.com/mena01/PLM-GAN-A-Large-Scale-Protein-Loop-Modeling-Using-pix2pix-GAN_.

Project description:Gene function curation of the literature with Gene Ontology (GO) concepts is one particularly time-consuming task in genomics, and the help from bioinformatics is highly requested to keep up with the flow of publications. In 2004, the first BioCreative challenge already designed a task of automatic GO concepts assignment from a full text. At this time, results were judged far from reaching the performances required by real curation workflows. In particular, supervised approaches produced the most disappointing results because of lack of training data. Ten years later, the available curation data have massively grown. In 2013, the BioCreative IV GO task revisited the automatic GO assignment task. For this issue, we investigated the power of our supervised classifier, GOCat. GOCat computes similarities between an input text and already curated instances contained in a knowledge base to infer GO concepts. The subtask A consisted in selecting GO evidence sentences for a relevant gene in a full text. For this, we designed a state-of-the-art supervised statistical approach, using a naïve Bayes classifier and the official training set, and obtained fair results. The subtask B consisted in predicting GO concepts from the previous output. For this, we applied GOCat and reached leading results, up to 65% for hierarchical recall in the top 20 outputted concepts. Contrary to previous competitions, machine learning has this time outperformed standard dictionary-based approaches. Thanks to BioCreative IV, we were able to design a complete workflow for curation: given a gene name and a full text, this system is able to select evidence sentences for curation and to deliver highly relevant GO concepts. Contrary to previous competitions, machine learning this time outperformed dictionary-based systems. Observed performances are sufficient for being used in a real semiautomatic curation workflow. GOCat is available at http://eagl.unige.ch/GOCat/.http://eagl.unige.ch/GOCat4FT/.

Dataset Information

Protein Structure Classification and Loop Modeling Using Multiple Ramachandran Distributions.

Publications

Protein Structure Classification and Loop Modeling Using Multiple Ramachandran Distributions.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets