Dataset Information

Discovery of cell-type specific DNA motif grammar in cis-regulatory elements using random Forest.

ABSTRACT: It has been observed that many transcription factors (TFs) can bind to different genomic loci depending on the cell type in which a TF is expressed in, even though the individual TF usually binds to the same core motif in different cell types. How a TF can bind to the genome in such a highly cell-type specific manner, is a critical research question. One hypothesis is that a TF requires co-binding of different TFs in different cell types. If this is the case, it may be possible to observe different combinations of TF motifs - a motif grammar - located at the TF binding sites in different cell types. In this study, we develop a bioinformatics method to systematically identify DNA motifs in TF binding sites across multiple cell types based on published ChIP-seq data, and address two questions: (1) can we build a machine learning classifier to predict cell-type specificity based on motif combinations alone, and (2) can we extract meaningful cell-type specific motif grammars from this classifier model.We present a Random Forest (RF) based approach to build a multi-class classifier to predict the cell-type specificity of a TF binding site given its motif content. We applied this RF classifier to two published ChIP-seq datasets of TF (TCF7L2 and MAX) across multiple cell types. Using cross-validation, we show that motif combinations alone are indeed predictive of cell types. Furthermore, we present a rule mining approach to extract the most discriminatory rules in the RF classifier, thus allowing us to discover the underlying cell-type specific motif grammar.Our bioinformatics analysis supports the hypothesis that combinatorial TF motif patterns are cell-type specific.

SUBMITTER: Wang X

PROVIDER: S-EPMC5780765 | biostudies-literature | 2018 Jan

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Discovery of cell-type specific DNA motif grammar in cis-regulatory elements using random Forest.

Wang Xin X Lin Peijie P Ho Joshua W K JWK

BMC genomics 20180119 Suppl 1

<h4>Background</h4>It has been observed that many transcription factors (TFs) can bind to different genomic loci depending on the cell type in which a TF is expressed in, even though the individual TF usually binds to the same core motif in different cell types. How a TF can bind to the genome in such a highly cell-type specific manner, is a critical research question. One hypothesis is that a TF requires co-binding of different TFs in different cell types. If this is the case, it may be possibl ...[more]

PMID: 29363433

Similar Datasets

Project description:BackgroundPost-transcriptional gene regulation controls the amount of protein produced from an individual mRNA by altering rates of decay and translation. Many sequence elements that direct post-transcriptional regulation have been found; in mammals, most such elements are located within the 3' untranslated regions (3'UTRs). Comparative genomic studies demonstrate that mammalian 3'UTRs contain extensive conserved sequence tracts, yet only a small fraction corresponds to recognized elements, implying that many additional novel elements exist. Despite a variety of computational, molecular, and biochemical approaches, identifying functional 3'UTRs elements remains difficult.ResultsWe created a high-throughput cell-based screen that enables identification of functional post-transcriptional 3'UTR regulatory elements. Our system exploits integrated single-copy reporters, which are expressed and processed as endogenous genes. We screened many thousands of short random sequences for their regulatory potential. Control sequences with known effects were captured effectively using our approach, establishing that our methodology was robust. We found hundreds of functional sequences, which we validated in traditional reporter assays, including verifying their regulatory impact in native sequence contexts. Although 3'UTRs are typically considered repressive, most of the functional elements were activating, including ones that were preferentially conserved. Additionally, we adapted our screening approach to examine the effect of elements on RNA abundance, revealing that most elements act by altering mRNA stability.ConclusionsWe developed and used a high-throughput approach to discover hundreds of post-transcriptional cis-regulatory elements. These results imply that most human 3'UTRs contain many previously unrecognized cis-regulatory elements, many of which are activating, and that the post-transcriptional fate of an mRNA is largely due to the actions of many individual cis-regulatory elements within its 3'UTR.

Project description:It is now established that, as compared to normal cells, the cancer cell genome has an overall inverse distribution of DNA methylation ("methylome"), i.e., predominant hypomethylation and localized hypermethylation, within "CpG islands" (CGIs). Moreover, although cancer cells have reduced methylation "fidelity" and genomic instability, accurate maintenance of aberrant methylomes that underlie malignant phenotypes remains necessary. However, the mechanism(s) of cancer methylome maintenance remains largely unknown. Here, we assessed CGI methylation patterns propagated over 1, 3, and 5 divisions of A2780 ovarian cancer cells, concurrent with exposure to the DNA cross-linking chemotherapeutic cisplatin, and observed cell generation-successive increases in total hyper- and hypo-methylated CGIs. Empirical bayesian modeling revealed five distinct modes of methylation propagation: (1) heritable (i.e., unchanged) high-methylation (1186 probe loci in CGI microarray); (2) heritable (i.e., unchanged) low-methylation (286 loci); (3) stochastic hypermethylation (i.e., progressively increased, 243 loci); (4) stochastic hypomethylation (i.e., progressively decreased, 247 loci); and (5) considerable "random" methylation (582 loci). These results support a "stochastic model" of DNA methylation equilibrium deriving from the efficiency of two distinct processes, methylation maintenance and de novo methylation. A role for cis-regulatory elements in methylation fidelity was also demonstrated by highly significant (p<2.2×10(-5)) enrichment of transcription factor binding sites in CGI probe loci showing heritably high (118 elements) and low (47 elements) methylation, and also in loci demonstrating stochastic hyper-(30 elements) and hypo-(31 elements) methylation. Notably, loci having "random" methylation heritability displayed nearly no enrichment. These results demonstrate an influence of cis-regulatory elements on the nonrandom propagation of both strictly heritable and stochastically heritable CGIs.

Dataset Information

Discovery of cell-type specific DNA motif grammar in cis-regulatory elements using random Forest.

Publications

Discovery of cell-type specific DNA motif grammar in cis-regulatory elements using random Forest.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets