Dataset Information

Integrative computational epigenomics to build data-driven gene regulation hypotheses.

ABSTRACT: Diseases are complex phenotypes often arising as an emergent property of a non-linear network of genetic and epigenetic interactions. To translate this resulting state into a causal relationship with a subset of regulatory features, many experiments deploy an array of laboratory assays from multiple modalities. Often, each of these resulting datasets is large, heterogeneous, and noisy. Thus, it is non-trivial to unify these complex datasets into an interpretable phenotype. Although recent methods address this problem with varying degrees of success, they are constrained by their scopes or limitations. Therefore, an important gap in the field is the lack of a universal data harmonizer with the capability to arbitrarily integrate multi-modal datasets. In this review, we perform a critical analysis of methods with the explicit aim of harmonizing data, as opposed to case-specific integration. This revealed that matrix factorization, latent variable analysis, and deep learning are potent strategies. Finally, we describe the properties of an ideal universal data harmonization framework. A sufficiently advanced universal harmonizer has major medical implications, such as (i) identifying dysregulated biological pathways responsible for a disease is a powerful diagnostic tool; (2) investigating these pathways further allows the biological community to better understand a disease's mechanisms; and (3) precision medicine also benefits from developments in this area, particularly in the context of the growing field of selective epigenome editing, which can suppress or induce a desired phenotype.

SUBMITTER: Chen T

PROVIDER: S-EPMC7297091 | biostudies-literature | 2020 Jun

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Integrative computational epigenomics to build data-driven gene regulation hypotheses.

Chen Tyrone T Tyagi Sonika S

GigaScience 20200601 6

<h4>Background</h4>Diseases are complex phenotypes often arising as an emergent property of a non-linear network of genetic and epigenetic interactions. To translate this resulting state into a causal relationship with a subset of regulatory features, many experiments deploy an array of laboratory assays from multiple modalities. Often, each of these resulting datasets is large, heterogeneous, and noisy. Thus, it is non-trivial to unify these complex datasets into an interpretable phenotype. Alt ...[more]

PMID: 32543653

Similar Datasets

Project description:BackgroundEpigenetic alterations are known to correlate with changes in gene expression among various diseases including cancers. However, quantitative models that accurately predict the up or down regulation of gene expression are currently lacking.MethodsA new machine learning-based method of gene expression prediction is developed in the context of lung cancer. This method uses the Illumina Infinium HumanMethylation450K Beadchip CpG methylation array data from paired lung cancer and adjacent normal tissues in The Cancer Genome Atlas (TCGA) and histone modification marker CHIP-Seq data from the ENCODE project, to predict the differential expression of RNA-Seq data in TCGA lung cancers. It considers a comprehensive list of 1424 features spanning the four categories of CpG methylation, histone H3 methylation modification, nucleotide composition, and conservation. Various feature selection and classification methods are compared to select the best model over 10-fold cross-validation in the training data set.ResultsA best model comprising 67 features is chosen by ReliefF based feature selection and random forest classification method, with AUC = 0.864 from the 10-fold cross-validation of the training set and AUC = 0.836 from the testing set. The selected features cover all four data types, with histone H3 methylation modification (32 features) and CpG methylation (15 features) being most abundant. Among the dropping-off tests of individual data-type based features, removal of CpG methylation feature leads to the most reduction in model performance. In the best model, 19 selected features are from the promoter regions (TSS200 and TSS1500), highest among all locations relative to transcripts. Sequential dropping-off of CpG methylation features relative to different regions on the protein coding transcripts shows that promoter regions contribute most significantly to the accurate prediction of gene expression.ConclusionsBy considering a comprehensive list of epigenomic and genomic features, we have constructed an accurate model to predict transcriptomic differential expression, exemplified in lung cancer.

Project description:The NCBI Gene Expression Omnibus (GEO) provides tools to query and download transcriptomic data. However, less than 4% of microbial experiments include the sample group annotations required to assess differential gene expression for high-throughput reanalysis, and data deposited after 2014 universally lack these annotations. Our algorithm GAUGE (general annotation using text/data group ensembles) automatically annotates GEO microbial data sets, including microarray and RNA sequencing studies, increasing the percentage of data sets amenable to analysis from 4% to 33%. Eighty-nine percent of GAUGE-annotated studies matched group assignments generated by human curators. To demonstrate how GAUGE annotation can lead to scientific insight, we created GAPE (GAUGE-annotated Pseudomonas aeruginosa and Escherichia coli transcriptomic compendia for reanalysis), a Shiny Web interface to analyze 73 GAUGE-annotated P. aeruginosa studies, three times more than previously available. GAPE analysis revealed that PA3923, a gene of unknown function, was frequently differentially expressed in more than 50% of studies and significantly coregulated with genes involved in biofilm formation. Follow-up wet-bench experiments demonstrate that PA3923 mutants are indeed defective in biofilm formation, consistent with predictions facilitated by GAUGE and GAPE. We anticipate that GAUGE and GAPE, which we have made freely available, will make publicly available microbial transcriptomic data easier to reuse and lead to new data-driven hypotheses.IMPORTANCE GEO archives transcriptomic data from over 5,800 microbial experiments and allows researchers to answer questions not directly addressed in published papers. However, less than 4% of the microbial data sets include the sample group annotations required for high-throughput reanalysis. This limitation blocks a considerable amount of microbial transcriptomic data from being reused easily. Here, we demonstrate that the GAUGE algorithm could make 33% of microbial data accessible to parallel mining and reanalysis. GAUGE annotations increase statistical power and, thereby, make consistent patterns of differential gene expression easier to identify. In addition, we developed GAPE (GAUGE-annotated Pseudomonas aeruginosa and Escherichia coli transcriptomic compendia for reanalysis), a Shiny Web interface that performs parallel analyses on P. aeruginosa and E. coli compendia. Source code for GAUGE and GAPE is freely available and can be repurposed to create compendia for other bacterial species.

Project description:BACKGROUND:With the resurgence of tick-borne diseases such as Lyme disease and the emergence of new tick-borne pathogens such as Powassan virus, understanding what distinguishes vectors from non-vectors, and predicting undiscovered tick vectors is a crucial step towards mitigating disease risk in humans. We aimed to identify intrinsic traits that predict which Ixodes tick species are confirmed or strongly suspected to be vectors of zoonotic pathogens. METHODS:We focused on the well-studied tick genus Ixodes from which many species are known to transmit zoonotic diseases to humans. We apply generalized boosted regression to interrogate over 90 features for over 240 species of Ixodes ticks to learn what intrinsic features distinguish zoonotic vectors from non-vector species. In addition to better understanding the biological underpinnings of tick vectorial capacity, the model generates a per species probability of being a zoonotic vector on the basis of intrinsic biological similarity with known Ixodes vector species. RESULTS:Our model predicted vector status with over 91% accuracy, and identified 14 Ixodes species with high probabilities (80%) of transmitting infections from animal hosts to humans on the basis of their traits. Distinguishing characteristics of zoonotic tick vectors of Ixodes tick species include several anatomical structures that influence host seeking behavior and blood-feeding efficiency from a greater diversity of host species compared to non-vectors. CONCLUSIONS:Overall, these results suggest that zoonotic tick vectors are most likely to be those species where adult females hold a fecundity advantage by producing more eggs per clutch, which develop into larvae that feed on a greater diversity of host species compared to non-vector species. These larvae develop into nymphs whose anatomy are well suited for more efficient and longer feeding times on soft-bodied hosts compared to non-vectors, leading to larger adult females with greater fecundity. In addition to identifying novel, testable hypotheses about intrinsic features driving vectorial capacity across Ixodes tick species, our model identifies particular Ixodes species with the highest probability of carrying zoonotic diseases, offering specific targets for increased zoonotic investigation and surveillance.

Dataset Information

Integrative computational epigenomics to build data-driven gene regulation hypotheses.

Publications

Integrative computational epigenomics to build data-driven gene regulation hypotheses.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets