Dataset Information

Integrative analysis and machine learning on cancer genomics data using the Cancer Systems Biology Database (CancerSysDB).

ABSTRACT:

Background

Recent cancer genome studies on many human cancer types have relied on multiple molecular high-throughput technologies. Given the vast amount of data that has been generated, there are surprisingly few databases which facilitate access to these data and make them available for flexible analysis queries in the broad research community. If used in their entirety and provided at a high structural level, these data can be directed into constantly increasing databases which bear an enormous potential to serve as a basis for machine learning technologies with the goal to support research and healthcare with predictions of clinically relevant traits.

Results

We have developed the Cancer Systems Biology Database (CancerSysDB), a resource for highly flexible queries and analysis of cancer-related data across multiple data types and multiple studies. The CancerSysDB can be adopted by any center for the organization of their locally acquired data and its integration with publicly available data from multiple studies. A publicly available main instance of the CancerSysDB can be used to obtain highly flexible queries across multiple data types as shown by highly relevant use cases. In addition, we demonstrate how the CancerSysDB can be used for predictive cancer classification based on whole-exome data from 9091 patients in The Cancer Genome Atlas (TCGA) research network.

Conclusions

Our database bears the potential to be used for large-scale integrative queries and predictive analytics of clinically relevant traits.

SUBMITTER: Krempel R

PROVIDER: S-EPMC5921751 | biostudies-literature | 2018 Apr

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Integrative analysis and machine learning on cancer genomics data using the Cancer Systems Biology Database (CancerSysDB).

Krempel Rasmus R Kulkarni Pranav P Yim Annie A Lang Ulrich U Habermann Bianca B Frommolt Peter P

BMC bioinformatics 20180424 1

<h4>Background</h4>Recent cancer genome studies on many human cancer types have relied on multiple molecular high-throughput technologies. Given the vast amount of data that has been generated, there are surprisingly few databases which facilitate access to these data and make them available for flexible analysis queries in the broad research community. If used in their entirety and provided at a high structural level, these data can be directed into constantly increasing databases which bear an ...[more]

PMID: 29699486

Similar Datasets

Project description:BackgroundDiabetes mellitus is a chronic disease that impacts an increasing percentage of people each year. Among its comorbidities, diabetics are two to four times more likely to develop cardiovascular diseases. While HbA1c remains the primary diagnostic for diabetics, its ability to predict long-term, health outcomes across diverse demographics, ethnic groups, and at a personalized level are limited. The purpose of this study was to provide a model for precision medicine through the implementation of machine-learning algorithms using multiple cardiac biomarkers as a means for predicting diabetes mellitus development.MethodsRight atrial appendages from 50 patients, 30 non-diabetic and 20 type 2 diabetic, were procured from the WVU Ruby Memorial Hospital. Machine-learning was applied to physiological, biochemical, and sequencing data for each patient. Supervised learning implementing SHapley Additive exPlanations (SHAP) allowed binary (no diabetes or type 2 diabetes) and multiple classification (no diabetes, prediabetes, and type 2 diabetes) of the patient cohort with and without the inclusion of HbA1c levels. Findings were validated through Logistic Regression (LR), Linear Discriminant Analysis (LDA), Gaussian Naïve Bayes (NB), Support Vector Machine (SVM), and Classification and Regression Tree (CART) models with tenfold cross validation.ResultsTotal nuclear methylation and hydroxymethylation were highly correlated to diabetic status, with nuclear methylation and mitochondrial electron transport chain (ETC) activities achieving superior testing accuracies in the predictive model (~ 84% testing, binary). Mitochondrial DNA SNPs found in the D-Loop region (SNP-73G, -16126C, and -16362C) were highly associated with diabetes mellitus. The CpG island of transcription factor A, mitochondrial (TFAM) revealed CpG24 (chr10:58385262, P = 0.003) and CpG29 (chr10:58385324, P = 0.001) as markers correlating with diabetic progression. When combining the most predictive factors from each set, total nuclear methylation and CpG24 methylation were the best diagnostic measures in both binary and multiple classification sets.ConclusionsUsing machine-learning, we were able to identify novel as well as the most relevant biomarkers associated with type 2 diabetes mellitus by integrating physiological, biochemical, and sequencing datasets. Ultimately, this approach may be used as a guideline for future investigations into disease pathogenesis and novel biomarker discovery.

Project description:Cancerogenesis is driven by mutations leading to aberrant functioning of a complex network of molecular interactions and simultaneously affecting multiple cellular functions. Therefore, the successful application of bioinformatics and systems biology methods for analysis of high-throughput data in cancer research heavily depends on availability of global and detailed reconstructions of signalling networks amenable for computational analysis. We present here the Atlas of Cancer Signalling Network (ACSN), an interactive and comprehensive map of molecular mechanisms implicated in cancer. The resource includes tools for map navigation, visualization and analysis of molecular data in the context of signalling network maps. Constructing and updating ACSN involves careful manual curation of molecular biology literature and participation of experts in the corresponding fields. The cancer-oriented content of ACSN is completely original and covers major mechanisms involved in cancer progression, including DNA repair, cell survival, apoptosis, cell cycle, EMT and cell motility. Cell signalling mechanisms are depicted in detail, together creating a seamless 'geographic-like' map of molecular interactions frequently deregulated in cancer. The map is browsable using NaviCell web interface using the Google Maps engine and semantic zooming principle. The associated web-blog provides a forum for commenting and curating the ACSN content. ACSN allows uploading heterogeneous omics data from users on top of the maps for visualization and performing functional analyses. We suggest several scenarios for ACSN application in cancer research, particularly for visualizing high-throughput data, starting from small interfering RNA-based screening results or mutation frequencies to innovative ways of exploring transcriptomes and phosphoproteomes. Integration and analysis of these data in the context of ACSN may help interpret their biological significance and formulate mechanistic hypotheses. ACSN may also support patient stratification, prediction of treatment response and resistance to cancer drugs, as well as design of novel treatment strategies.

Dataset Information

Integrative analysis and machine learning on cancer genomics data using the Cancer Systems Biology Database (CancerSysDB).

Background

Results

Conclusions

Publications

Integrative analysis and machine learning on cancer genomics data using the Cancer Systems Biology Database (CancerSysDB).

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets