Causal interactions from proteomic profiles: Molecular data meet pathway knowledge.
Ontology highlight
ABSTRACT: We present a computational method to infer causal mechanisms in cell biology by analyzing changes in high-throughput proteomic profiles on the background of prior knowledge captured in biochemical reaction knowledge bases. The method mimics a biologist's traditional approach of explaining changes in data using prior knowledge but does this at the scale of hundreds of thousands of reactions. This is a specific example of how to automate scientific reasoning processes and illustrates the power of mapping from experimental data to prior knowledge via logic programming. The identified mechanisms can explain how experimental and physiological perturbations, propagating in a network of reactions, affect cellular responses and their phenotypic consequences. Causal pathway analysis is a powerful and flexible discovery tool for a wide range of cellular profiling data types and biological questions. The automated causation inference tool, as well as the source code, are freely available at http://causalpath.org.
Project description:Approaches to identify significant pathways from high-throughput quantitative data have been developed in recent years. Still, the analysis of proteomic data stays difficult because of limited sample size. This limitation also leads to the practice of using a competitive null as common approach; which fundamentally implies genes or proteins as independent units. The independent assumption ignores the associations among biomolecules with similar functions or cellular localization, as well as the interactions among them manifested as changes in expression ratios. Consequently, these methods often underestimate the associations among biomolecules and cause false positives in practice. Some studies incorporate the sample covariance matrix into the calculation to address this issue. However, sample covariance may not be a precise estimation if the sample size is very limited, which is usually the case for the data produced by mass spectrometry. In this study, we introduce a multivariate test under a self-contained null to perform pathway analysis for quantitative proteomic data. The covariance matrix used in the test statistic is constructed by the confidence scores retrieved from the STRING database or the HitPredict database. We also design an integrating procedure to retain pathways of sufficient evidence as a pathway group. The performance of the proposed T2-statistic is demonstrated using five published experimental datasets: the T-cell activation, the cAMP/PKA signaling, the myoblast differentiation, and the effect of dasatinib on the BCR-ABL pathway are proteomic datasets produced by mass spectrometry; and the protective effect of myocilin via the MAPK signaling pathway is a gene expression dataset of limited sample size. Compared with other popular statistics, the proposed T2-statistic yields more accurate descriptions in agreement with the discussion of the original publication. We implemented the T2-statistic into an R package T2GA, which is available at https://github.com/roqe/T2GA.
Project description:The integration of multi-omic data sets can provide unique information about molecular processes in a cell. Despite the development of many tools to extract information from such data sets, there are limited strategies to systematically extract mechanistic hypotheses from them. We here present COSMOS (Causal Oriented Search of Multi-Omic Space), a method that integrates cell signaling pathways, transcriptional, and metabolics data sets. COSMOS leverages extensive prior knowledge of interactions between biomolecules with computational methods to estimate activities of transcription factors and kinases as well as network-level causal reasoning. COSMOS can provide mechanistic explanations for experimental observations across multiple omic data sets. We applied COSMOS to a dataset comprising transcriptomic, phosphoproteomic, and metabolomic data from nine renal cell carcinoma patients comparing healthy non affected kidney tissue and kidney cancer. We used COSMOS to generate novel hypotheses such as the impact of CDK7 on nucleoside metabolism and its influence on citrulline production, that we validated experimentally. We expect that our freely available method will be broadly useful to extract mechanistic insights from multi-omic studies.
Project description:BackgroundGene expression profiling and other genome-scale measurement technologies provide comprehensive information about molecular changes resulting from a chemical or genetic perturbation, or disease state. A critical challenge is the development of methods to interpret these large-scale data sets to identify specific biological mechanisms that can provide experimentally verifiable hypotheses and lead to the understanding of disease and drug action.ResultsWe present a detailed description of Reverse Causal Reasoning (RCR), a reverse engineering methodology to infer mechanistic hypotheses from molecular profiling data. This methodology requires prior knowledge in the form of small networks that causally link a key upstream controller node representing a biological mechanism to downstream measurable quantities. These small directed networks are generated from a knowledge base of literature-curated qualitative biological cause-and-effect relationships expressed as a network. The small mechanism networks are evaluated as hypotheses to explain observed differential measurements. We provide a simple implementation of this methodology, Whistle, specifically geared towards the analysis of gene expression data and using prior knowledge expressed in Biological Expression Language (BEL). We present the Whistle analyses for three transcriptomic data sets using a publically available knowledge base. The mechanisms inferred by Whistle are consistent with the expected biology for each data set.ConclusionsReverse Causal Reasoning yields mechanistic insights to the interpretation of gene expression profiling data that are distinct from and complementary to the results of analyses using ontology or pathway gene sets. This reverse engineering algorithm provides an evidence-driven approach to the development of models of disease, drug action, and drug toxicity.
Project description:Backgroundβ-Crystallins are structural proteins maintaining eye lens transparency and opacification. Previous work demonstrated that dimerization of both βA3 and βB2 crystallins (βA3 and βB2) involves endothermic enthalpy of association (∼8 kcal/mol) mediated by hydrophobic interactions.Methodology/principal findingsThermodynamic profiles of the associations of dimeric βA3 and βB1 and tetrameric βB1/βA3 were measured using sedimentation equilibrium. The homo- and heteromolecular associations of βB1 crystallin are dominated by exothermic enthalpy (-13.3 and -24.5 kcal/mol, respectively).Conclusions/significanceGlobal thermodynamics of βB1 interactions suggest a role in the formation of stable protein complexes in the lens via specific van der Waals contacts, hydrogen bonds and salt bridges whereas those β-crystallins which associate by predominately hydrophobic forces participate in a weaker protein associations.
Project description:Previously, we have shown that the aggregation of RNA-level gene expression profiles into quantitative molecular pathway activation metrics results in lesser batch effects and better agreement between different experimental platforms. Here, we investigate whether pathway level of data analysis provides any advantage when comparing transcriptomic and proteomic data. We compare the paired proteomic and transcriptomic gene expression and pathway activation profiles obtained for the same human cancer biosamples in The Cancer Genome Atlas (TCGA) and the NCI Clinical Proteomic Tumor Analysis Consortium (CPTAC) projects, for a total of 755 samples of glioblastoma, breast, liver, lung, ovarian, pancreatic, and uterine cancers. In a CPTAC assay, expression levels of 15,112 protein-coding genes were profiled using the Thermo QE series of mass spectrometers. In TCGA, RNA expression levels of the same genes were obtained using the Illumina HiSeq 4000 engine for the same biosamples. At the gene level, absolute gene expression values are compared, whereas pathway-grade comparisons are made between the pathway activation levels (PALs) calculated using average sample-normalized transcriptomic and proteomic profiles. We observed remarkably different average correlations between the primary RNA- and protein expression data for different cancer types: Spearman Rho between 0.017 (p = 1.7 × 10−13) and 0.27 (p < 2.2 × 10−16). However, at the pathway level we detected overall statistically significantly higher correlations: averaged Rho between 0.022 (p < 2.2 × 10−16) and 0.56 (p < 2.2 × 10−16). Thus, we conclude that data analysis at the PAL-level yields results of a greater similarity when comparing high-throughput RNA and protein expression profiles.
Project description:Multi-omics datasets can provide molecular insights beyond the sum of individual omics. Various tools have been recently developed to integrate such datasets, but there are limited strategies to systematically extract mechanistic hypotheses from them. Here, we present COSMOS (Causal Oriented Search of Multi-Omics Space), a method that integrates phosphoproteomics, transcriptomics, and metabolomics datasets. COSMOS combines extensive prior knowledge of signaling, metabolic, and gene regulatory networks with computational methods to estimate activities of transcription factors and kinases as well as network-level causal reasoning. COSMOS provides mechanistic hypotheses for experimental observations across multi-omics datasets. We applied COSMOS to a dataset comprising transcriptomics, phosphoproteomics, and metabolomics data from healthy and cancerous tissue from eleven clear cell renal cell carcinoma (ccRCC) patients. COSMOS was able to capture relevant crosstalks within and between multiple omics layers, such as known ccRCC drug targets. We expect that our freely available method will be broadly useful to extract mechanistic insights from multi-omics studies.
Project description:Depression is a highly heterogeneous disorder. Accumulating evidence suggests biological and genetic differences between subtypes of depression that are homogeneous in symptom presentation. We aimed to evaluate differences in serum protein profiles between persons with atypical and melancholic depressive subtypes, and compare these profiles with serum protein levels of healthy controls. We used the baseline data from the Netherlands Study of Depression and Anxiety on 414 controls, 231 persons with a melancholic depressive subtype and 128 persons with an atypical depressive subtype for whom the proteomic data were available. Depressive subtypes were previously established using a data-driven analysis, and 171 serum proteins were measured on a multi-analyte profiling platform. Linear regression models were adjusted for several covariates and corrected for multiple testing using false discovery rate q-values. We observed differences in analytes between the atypical and melancholic subtypes (9 analytes, q<0.05) and between atypical depression and controls (23 analytes, q<0.05). Eight of the nine markers differing between the atypical and melancholic subtype overlapped with markers from the comparison between atypical subtype and controls (mesothelin, leptin, IGFBP1, IGFBP2, FABPa, insulin, C3 and B2M), and were mainly involved in cellular communication and signal transduction, and immune response. No markers differed significantly between the melancholic subtype and controls. To conclude, although some uncertainties exist in our results as a result of missing data imputation and lack of proteomic replication samples, many of the identified analytes are inflammatory or metabolic markers, which supports the notion of atypical depression as a syndrome characterized by metabolic disturbances and inflammation, and underline the importance and relevance of subtypes of depression in biological and genetic research, and potentially in the treatment of depression.
Project description:Formation of mature miRNAs and their expression is a highly controlled process. It is very much dependent upon the post-transcriptional regulatory events. Recent findings suggest that several RNA binding proteins beyond Drosha/Dicer are involved in the processing of miRNAs. Deciphering of conditional networks for these RBP-miRNA interactions may help to reason the spatio-temporal nature of miRNAs which can also be used to predict miRNA profiles. In this direction, >25TB of data from different platforms were studied (CLIP-seq/RNA-seq/miRNA-seq) to develop Bayesian causal networks capable of reasoning miRNA biogenesis. The networks ably explained the miRNA formation when tested across a large number of conditions and experimentally validated data. The networks were modeled into an XGBoost machine learning system where expression information of the network components was found capable to quantitatively explain the miRNAs formation levels and their profiles. The models were developed for 1,204 human miRNAs whose accurate expression level could be detected directly from the RNA-seq data alone without any need of doing separate miRNA profiling experiments like miRNA-seq or arrays. A first of its kind, miRbiom performed consistently well with high average accuracy (91%) when tested across a large number of experimentally established data from several conditions. It has been implemented as an interactive open access web-server where besides finding the profiles of miRNAs, their downstream functional analysis can also be done. miRbiom will help to get an accurate prediction of human miRNAs profiles in the absence of profiling experiments and will be an asset for regulatory research areas. The study also shows the importance of having RBP interaction information in better understanding the miRNAs and their functional projectiles where it also lays the foundation of such studies and software in future.
Project description:The regression discontinuity (RD) design is a quasi-experimental design that estimates the causal effects of a treatment by exploiting naturally occurring treatment rules. It can be applied in any context where a particular treatment or intervention is administered according to a pre-specified rule linked to a continuous variable. Such thresholds are common in primary care drug prescription where the RD design can be used to estimate the causal effect of medication in the general population. Such results can then be contrasted to those obtained from randomised controlled trials (RCTs) and inform prescription policy and guidelines based on a more realistic and less expensive context. In this paper, we focus on statins, a class of cholesterol-lowering drugs, however, the methodology can be applied to many other drugs provided these are prescribed in accordance to pre-determined guidelines. Current guidelines in the UK state that statins should be prescribed to patients with 10-year cardiovascular disease risk scores in excess of 20%. If we consider patients whose risk scores are close to the 20% risk score threshold, we find that there is an element of random variation in both the risk score itself and its measurement. We can therefore consider the threshold as a randomising device that assigns statin prescription to individuals just above the threshold and withholds it from those just below. Thus, we are effectively replicating the conditions of an RCT in the area around the threshold, removing or at least mitigating confounding. We frame the RD design in the language of conditional independence, which clarifies the assumptions necessary to apply an RD design to data, and which makes the links with instrumental variables clear. We also have context-specific knowledge about the expected sizes of the effects of statin prescription and are thus able to incorporate this into Bayesian models by formulating informative priors on our causal parameters.
Project description:Reproductive health program managers seek information about existing and potential clients' motivations, behaviors, and barriers to services. Using sequence and cluster analysis of contraceptive calendar data from the 2016-17 Burundi Demographic and Health Survey, we identified discrete clusters characterizing patterns in women's contraceptive and pregnancy behaviors over the previous 5 years. This study pairs these clusters with data on factors typically targeted in social behavior change interventions: knowledge, attitudes, and women's interactions with media and health services, to create composite profiles of women in these clusters. Of six clusters, three are characterized by contraceptive use and three are characterized by its absence. Media exposure and attitudes regarding sex preference, wife beating, and self-efficacy largely do not explain cluster membership. Contraceptive knowledge is positively associated with two clusters (Family Builder 1 and Traditional Mother) and negatively associated with a third (Quiet Calendar). Clusters also differ in their members' fertility desires, contraceptive intentions, and interactions with health services. Two "Family Builder" clusters are both characterized by the presence (but not timing) of multiple pregnancies in their calendar histories, but differ in that women with high contraceptive knowledge, intentions to use contraception, and well-articulated family size ideals are characteristic of one cluster (Family Builder 1), and low contraceptive knowledge, no use of contraception, and vague family size preferences are characteristic of the other (Family Builder 2). These results can guide reproductive health programs as they target social and behavioral change and other interventions to the unique subpopulations they seek to serve.