Project description:Understanding 3D genome structure requires high throughput, genome-wide approaches. However, assays for all vs. all chromatin interaction mapping are expensive and time consuming, which severely restricts their usage for large-scale mutagenesis screens or for mapping the impact of sequence variants. Computational models sophisticated enough to grasp the determinants of chromatin folding provide a unique window into the functional determinants of 3D genome structure as well as the effects of genome variation. A chromatin interaction predictor should work at the base pair level but also incorporate large-scale genomic context to simultaneously capture the large scale and intricate structures of chromatin architecture. Similarly, to be a flexible and generalisable approach it should also be applicable to data it has not been explicitly trained on. To develop a model with these properties, we designed a deep neuronal network (deepC) that utilizes transfer learning to accurately predict chromatin interactions from DNA sequence at megabase scale. The model generalizes well to unseen chromosomes and works across cell types, Hi-C data resolutions and a range of sequencing depths. DeepC integrates DNA sequence context on an unprecedented scale, bridging the different levels of resolution from base pairs to TADs. We demonstrate how this model allows us to investigate sequence determinants of chromatin folding at genome-wide scale and to predict the importance of regulatory elements and the impact of sequence variations.
Project description:Understanding 3D genome structure requires high throughput, genome-wide approaches. However, assays for all vs. all chromatin interaction mapping are expensive and time consuming, which severely restricts their usage for large-scale mutagenesis screens or for mapping the impact of sequence variants. Computational models sophisticated enough to grasp the determinants of chromatin folding provide a unique window into the functional determinants of 3D genome structure as well as the effects of genome variation. A chromatin interaction predictor should work at the base pair level but also incorporate large-scale genomic context to simultaneously capture the large scale and intricate structures of chromatin architecture. Similarly, to be a flexible and generalisable approach it should also be applicable to data it has not been explicitly trained on. To develop a model with these properties, we designed a deep neuronal network (deepC) that utilizes transfer learning to accurately predict chromatin interactions from DNA sequence at megabase scale. The model generalizes well to unseen chromosomes and works across cell types, Hi-C data resolutions and a range of sequencing depths. DeepC integrates DNA sequence context on an unprecedented scale, bridging the different levels of resolution from base pairs to TADs. We demonstrate how this model allows us to investigate sequence determinants of chromatin folding at genome-wide scale and to predict the importance of regulatory elements and the impact of sequence variations.
Project description:Understanding how regulatory sequences interact in the context of chromosomal architecture is a central challenge in biology. Chromosome conformation capture revealed that mammalian chromosomes possess a rich hierarchy of structural layers, from multi-megabase compartments to sub-megabase topologically associating domains (TADs), and further down to sub-TAD loop domains. TADs appear to act as regulatory microenvironments by constraining and segregating regulatory interactions across discrete chromosomal regions. However, it is unclear whether other (or all) folding layers share similar properties, or rather TADs constitute a privileged folding scale with maximal impact on the organization of regulatory interactions. Here we present a novel parameter-free algorithm (CaTCH) that identifies hierarchical trees of chromosomal domains in Hi-C maps, stratified through their reciprocal physical insulation which is a simple and biologically relevant property. By applying CaTCH to published Hi-C datasets, we show that previously reported folding layers appear at different insulation levels. We demonstrate that although no structurally privileged folding level exists, TADs emerge as a functionally privileged scale defined by maximal enrichment of CTCF at boundaries, and maximal cell-type conservation. By measuring transcriptional output in embryonic stem cells and neural precursor cells, we show that TADs also maximize the likelihood that genes in a domain are co-regulated during differentiation. Finally, we observe that regulatory sequences occur at genomic locations corresponding to optimized mutual interactions at the scale of TADs. Our analysis thus suggests that the architectural functionality of TADs arises from the interplay between their ability to partition interactions and the genomic position of regulatory sequences.