Dataset Information

An information gain-based approach for evaluating protein structure models.

ABSTRACT: For three decades now, knowledge-based scoring functions that operate through the "potential of mean force" (PMF) approach have continuously proven useful for studying protein structures. Although these statistical potentials are not to be confused with their physics-based counterparts of the same name-i.e. PMFs obtained by molecular dynamics simulations-their particular success in assessing the native-like character of protein structure predictions has lead authors to consider the computed scores as approximations of the free energy. However, this physical justification is a matter of controversy since the beginning. Alternative interpretations based on Bayes' theorem have been proposed, but the misleading formalism that invokes the inverse Boltzmann law remains recurrent in the literature. In this article, we present a conceptually new method for ranking protein structure models by quality, which is (i) independent of any physics-based explanation and (ii) relevant to statistics and to a general definition of information gain. The theoretical development described in this study provides new insights into how statistical PMFs work, in comparison with our approach. To prove the concept, we have built interatomic distance-dependent scoring functions, based on the former and new equations, and compared their performance on an independent benchmark of 60,000 protein structures. The results demonstrate that our new formalism outperforms statistical PMFs in evaluating the quality of protein structural decoys. Therefore, this original type of score offers a possibility to improve the success of statistical PMFs in the various fields of structural biology where they are applied. The open-source code is available for download at https://gitlab.rpbs.univ-paris-diderot.fr/src/ig-score.

SUBMITTER: Postic G

PROVIDER: S-EPMC7431362 | biostudies-literature | 2020

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

An information gain-based approach for evaluating protein structure models.

Postic Guillaume G Janel Nathalie N Tufféry Pierre P Moroy Gautier G

Computational and structural biotechnology journal 20200818

For three decades now, knowledge-based scoring functions that operate through the "potential of mean force" (PMF) approach have continuously proven useful for studying protein structures. Although these statistical potentials are not to be confused with their physics-based counterparts of the same name-<i>i.e.</i> PMFs obtained by molecular dynamics simulations-their particular success in assessing the native-like character of protein structure predictions has lead authors to consider the comput ...[more]

PMID: 32837711

Similar Datasets

Project description:The increasing use of species distribution modeling (SDM) has raised new concerns regarding the inaccuracies, misunderstanding, and misuses of this important tool. One of those possible pitfalls - collinearity among environmental predictors - is assumed as an important source of model uncertainty, although it has not been subjected to a detailed evaluation in recent SDM studies. It is expected that collinearity will increase uncertainty in model parameters and decrease statistical power. Here we use a virtual species approach to compare models built using subsets of PCA-derived variables with models based on the original highly correlated climate variables. Moreover, we evaluated whether modelling algorithms and species data characteristics generate models with varying sensitivity to collinearity. As expected, collinearity among predictors decreases the efficiency and increases the uncertainty of species distribution models. Nevertheless, the intensity of the effect varied according to the algorithm properties: more complex procedures behaved better than simple envelope models. This may support the claim that complex models such as Maxent take advantage of existing collinearity in finding the best set of parameters. The interaction of the different factors with species characteristics (centroid and tolerance in environmental space) highlighted the importance of the so-called "idiosyncrasy in species responses" to model efficiency, but differences in prevalence may represent a better explanation. However, even models with low accuracy to predict suitability of individual cells may provide meaningful information on the estimation of range-size, a key species-trait for macroecological studies. We concluded that the use of PCA-derived variables is advised both to control the negative effects of collinearity and as a more objective solution for the problem of variable selection in studies dealing with large number of species with heterogeneous responses to environmental variables.

Project description:Much structural information is encoded in the internal distances; a distance matrix-based approach can be used to predict protein structure and dynamics, and for structural refinement. Our approach is based on the square distance matrix D = [r(ij)(2)] containing all square distances between residues in proteins. This distance matrix contains more information than the contact matrix C, that has elements of either 0 or 1 depending on whether the distance r (ij) is greater or less than a cutoff value r (cutoff). We have performed spectral decomposition of the distance matrices D = sigma lambda(k)V(k)V(kT), in terms of eigenvalues lambda kappa and the corresponding eigenvectors v kappa and found that it contains at most five nonzero terms. A dominant eigenvector is proportional to r (2)--the square distance of points from the center of mass, with the next three being the principal components of the system of points. By predicting r (2) from the sequence we can approximate a distance matrix of a protein with an expected RMSD value of about 7.3 A, and by combining it with the prediction of the first principal component we can improve this approximation to 4.0 A. We can also explain the role of hydrophobic interactions for the protein structure, because r is highly correlated with the hydrophobic profile of the sequence. Moreover, r is highly correlated with several sequence profiles which are useful in protein structure prediction, such as contact number, the residue-wise contact order (RWCO) or mean square fluctuations (i.e. crystallographic temperature factors). We have also shown that the next three components are related to spatial directionality of the secondary structure elements, and they may be also predicted from the sequence, improving overall structure prediction. We have also shown that the large number of available HIV-1 protease structures provides a remarkable sampling of conformations, which can be viewed as direct structural information about the dynamics. After structure matching, we apply principal component analysis (PCA) to obtain the important apparent motions for both bound and unbound structures. There are significant similarities between the first few key motions and the first few low-frequency normal modes calculated from a static representative structure with an elastic network model (ENM) that is based on the contact matrix C (related to D), strongly suggesting that the variations among the observed structures and the corresponding conformational changes are facilitated by the low-frequency, global motions intrinsic to the structure. Similarities are also found when the approach is applied to an NMR ensemble, as well as to atomic molecular dynamics (MD) trajectories. Thus, a sufficiently large number of experimental structures can directly provide important information about protein dynamics, but ENM can also provide a similar sampling of conformations. Finally, we use distance constraints from databases of known protein structures for structure refinement. We use the distributions of distances of various types in known protein structures to obtain the most probable ranges or the mean-force potentials for the distances. We then impose these constraints on structures to be refined or include the mean-force potentials directly in the energy minimization so that more plausible structural models can be built. This approach has been successfully used by us in 2006 in the CASPR structure refinement (http://predictioncenter.org/caspR).

Project description:BackgroundThere is a flora of health care information models but no consensus on which to use. This leads to poor information sharing and duplicate modelling work. The amount and type of differences between models has, to our knowledge, not been evaluated.ObjectiveThis work aims to explore how information structured with various information models differ in practice. Our hypothesis is that differences between information models are overestimated. This work will also assess the usability of competency questions as a method for evaluation of information models within health care.MethodsIn this study, 4 information standards, 2 standards for secondary use, and 2 electronic health record systems were included as material. Competency questions were developed for a random selection of recommendations from a clinical guideline. The information needed to answer the competency questions was modelled according to each included information model, and the results were analyzed. Differences in structure and terminology were quantified for each combination of standards.ResultsIn this study, 36 competency questions were developed and answered. In general, similarities between the included information models were larger than the differences. The demarcation between information model and terminology was overall similar; on average, 45% of the included structures were identical between models. Choices of terminology differed within and between models; on average, 11% was usable in interaction with each other. The information models included in this study were able to represent most information required for answering the competency questions.ConclusionsDifferent but same same; in practice, different information models structure much information in a similar fashion. To increase interoperability within and between systems, it is more important to move toward structuring information with any information model rather than finding or developing a perfect information model. Competency questions are a feasible way of evaluating how information models perform in practice.

Dataset Information

An information gain-based approach for evaluating protein structure models.

Publications

An information gain-based approach for evaluating protein structure models.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets