Project description:Detecting in vivo transcription factor (TF) binding is important for understanding gene regulatory circuitries. ChIP-seq is a powerful technique to empirically define TF binding in vivo. However, the multitude of distinct TFs makes genome-wide profiling for them all labor-intensive and costly. Algorithms for in silico prediction of TF binding have been developed, based mostly on histone modification or DNase I hypersensitivity data in conjunction with DNA motif and other genomic features. However, technical limitations of these methods prevent them from being applied broadly, especially in clinical settings. We conducted a comprehensive survey involving multiple cell lines, TFs, and methylation types and found that there are intimate relationships between TF binding and methylation level changes around the binding sites. Exploiting the connection between DNA methylation and TF binding, we proposed a novel supervised learning approach to predict TF-DNA interaction using data from base-resolution whole-genome methylation sequencing experiments. We devised beta-binomial models to characterize methylation data around TF binding sites and the background. Along with other static genomic features, we adopted a random forest framework to predict TF-DNA interaction. After conducting comprehensive tests, we saw that the proposed method accurately predicts TF binding and performs favorably versus competing methods. Examine Oct4 genome-wide binding in mouse embryonic stem cells (E14)
Project description:Detecting in vivo transcription factor (TF) binding is important for understanding gene regulatory circuitries. ChIP-seq is a powerful technique to empirically define TF binding in vivo. However, the multitude of distinct TFs makes genome-wide profiling for them all labor-intensive and costly. Algorithms for in silico prediction of TF binding have been developed, based mostly on histone modification or DNase I hypersensitivity data in conjunction with DNA motif and other genomic features. However, technical limitations of these methods prevent them from being applied broadly, especially in clinical settings. We conducted a comprehensive survey involving multiple cell lines, TFs, and methylation types and found that there are intimate relationships between TF binding and methylation level changes around the binding sites. Exploiting the connection between DNA methylation and TF binding, we proposed a novel supervised learning approach to predict TF-DNA interaction using data from base-resolution whole-genome methylation sequencing experiments. We devised beta-binomial models to characterize methylation data around TF binding sites and the background. Along with other static genomic features, we adopted a random forest framework to predict TF-DNA interaction. After conducting comprehensive tests, we saw that the proposed method accurately predicts TF binding and performs favorably versus competing methods.
Project description:The study of 5-hydroxylmethylcytosines (5hmC), the sixth base of the mammalian genome, as an epigenetic mark has been hampered by a lack of method to map it at single-base resolution. Previous affinity purification-based methods could not precisely locate 5hmC nor accurately determine its relative abundance at each modified site. We here present a genome-wide approach for mapping 5hmC at base resolution. Application of this new method to the embryonic stem cells not only confirms widespread distribution of 5hmC in mammalian genome, but also reveals a strong sequence bias and strand asymmetry at sites of 5hmC. Additionally, the relative abundance of 5hmC varies significantly depending on the types of functional sequences, suggesting different mechanisms for 5hmC deposition and maintenance. Furthermore, we observe high levels of 5hmC and reciprocally low levels of 5mC at transcription factor binding sites, revealing a dynamic DNA methylation process at cis-regulatory elements. Base resolution sequencing of 5 hydroxymethylcytosine in human and mouse embryonic stem cells
Project description:We performed Assay for Transposase-Accessible Chromatin using sequencing (ATAC-seq) to profile genome-wide chromatin accessibility in the human H1 embryonic stem cell (ESC) line. We used this data to train a deep learning model called ChromBPNet which can accurately predict base-resolution accessibility profiles as a function of DNA sequence, while accounting for and correcting biases due the sequence preferences of the Tn5 transposase used in ATAC-seq. We interpreted the models to identify globally predictive transcription factor (TF) motifs, individual predictive motif instances in all accessible regions and Tn5-bias corrected canonical footprints of TFs at these predictive motifs.
Project description:We introduce Affinity Distillation (AD), a method for extracting thermodynamic affinities de-novo from in-vivo immunoprecipitation experiments using deep learning. We show that neural networks modeling base-resolution in-vivo binding profiles of yeast and mammalian TFs can accurately predict energetic impacts of varying underlying DNA sequence on TF binding. Systematic comparisons between Affinity Distillation predictions and other predictive algorithms consistently show that Affinity Distillation more accurately predicts affinities across a wide range of TF structural classes and DNA sequences. Affinity Distillation relies on in-silico marginalization against many sequence backgrounds, resulting in a higher dynamic range and more accurate predictions than motif discovery algorithms. Moreover, we show that Affinity Distillation can learn differential paralog-specific affinities, thereby making it possible to more accurately reconstruct regulatory networks in cells.
Project description:Here we characterize an association between disease progression and DNA methylation in Diffuse Large B cell Lymphoma (DLBCL). By profiling genome-wide DNA methylation at single base-pair resolution in thirteen DLBCL diagnosis-relapse sample pairs, we show DLBCL patients exhibit heterogeneous evolution of tumor methylomes during relapse. We identify differentially methylated regulatory elements and determine a relapse–associated methylation signature converging on key pathways such as transforming growth factor beta (TGF-beta) receptor activity. We also observe decreased intra-tumor methylation heterogeneity from diagnosis to relapsed tumor samples. Relapse-free patients display lower intra-tumor methylation heterogeneity at diagnosis compared to relapsed patients in an independent validation cohort. Furthermore, intra-tumor methylation heterogeneity is predictive of time to relapse. Therefore, we propose that epigenomic heterogeneity may support or drive the relapse phenotype and can be used to predict DLBCL relapse. Using ERRBS, we profiled genome-wide DNA methylation patterns of non-relapse DLBCL tumor samples at diagnosis, relaspe DLBCL patient samples at diagnosis and relaspe.
Project description:The goal of this study was discover the transcription binding synthax for the key differentiation TFs in mouse embryonic stem cells. Genes are regulated through enhancer sequences, in which transcription factor binding motifs and their specific arrangements (syntax) form a cis-regulatory code. To understand the relationship between motif syntax and transcription factor binding, we train a deep learning model that uses DNA sequence to predict base-resolution binding profiles of four pluripotency transcription factors Oct4, Sox2, Nanog, and Klf4. We interpret the model to accurately map hundreds of thousands of motifs in the genome, learn novel motif representations and identify rules by which motifs and syntax influence transcription factor binding. We find that instances of strict motif spacing are largely due to retrotransposons, but that soft motif syntax influences motif interactions at protein and nucleosome range. Most strikingly, Nanog binding is driven by motifs with a strong preference for ~10.5 bp spacings corresponding to helical periodicity. Interpreting deep learning models applied to high-resolution binding data is a powerful and versatile approach to uncover the motifs and syntax of cis-regulatory sequences.
Project description:The study of 5-hydroxylmethylcytosines (5hmC), the sixth base of the mammalian genome, as an epigenetic mark has been hampered by a lack of method to map it at single-base resolution. Previous affinity purification-based methods could not precisely locate 5hmC nor accurately determine its relative abundance at each modified site. We here present a genome-wide approach for mapping 5hmC at base resolution. Application of this new method to the embryonic stem cells not only confirms widespread distribution of 5hmC in mammalian genome, but also reveals a strong sequence bias and strand asymmetry at sites of 5hmC. Additionally, the relative abundance of 5hmC varies significantly depending on the types of functional sequences, suggesting different mechanisms for 5hmC deposition and maintenance. Furthermore, we observe high levels of 5hmC and reciprocally low levels of 5mC at transcription factor binding sites, revealing a dynamic DNA methylation process at cis-regulatory elements.