Determination and Inference of Eukaryotic Transcription Factor Sequence Specificity
Ontology highlight
ABSTRACT: The DNA sequence preferences of the vast majority of eukaryotic transcription factors (TFs) are unknown. Using an approach designed to broadly sample both DNA-binding domain types and eukaryotic clades, we have determined DNA-binding motifs for 1,033 TFs from 131 diverse eukaryotes, encompassing 54 domain types. Closely related orthologs and paralogs typically have very similar sequence preferences; this property allows inference of motifs for roughly one third of the 166,851 known or predicted eukaryotic TFs. While the origins of most motifs can be dated to hundreds of millions of years ago, we also characterize more recent TF expansions. Sequences matching the motifs are enriched upstream of TSS in most eukaryotic lineages, and at informative eQTL SNPs in Arabidopsis promoters, demonstrating their utility in mapping transcriptional networks. The motifs are housed at http://cisbp.ccbr.utoronto.ca
Project description:The DNA sequence preferences of the vast majority of eukaryotic transcription factors (TFs) are unknown. Using an approach designed to broadly sample both DNA-binding domain types and eukaryotic clades, we have determined DNA-binding motifs for 1,033 TFs from 131 diverse eukaryotes, encompassing 54 domain types. Closely related orthologs and paralogs typically have very similar sequence preferences; this property allows inference of motifs for roughly one third of the 166,851 known or predicted eukaryotic TFs. While the origins of most motifs can be dated to hundreds of millions of years ago, we also characterize more recent TF expansions. Sequences matching the motifs are enriched upstream of TSS in most eukaryotic lineages, and at informative eQTL SNPs in Arabidopsis promoters, demonstrating their utility in mapping transcriptional networks. The motifs are housed at http://cisbp.ccbr.utoronto.ca Protein binding microarray (PBM) experiments were performed for a set of 1048 diverse eukaryotic transcription factors. Briefly, the PBMs involved binding GST-tagged DNA-binding proteins to two double-stranded 44K Agilent microarrays, each containing a different DeBruijn sequence design, in order to determine their sequence preferences. Details of the PBM protocol are described in Berger et al., Nature Biotechnology 2006.
Project description:Genomic analyses often involve scanning for potential transcription-factor (TF) binding sites using models of the sequence specificity of DNA binding proteins. Many approaches have been developed to model and learn a protein’s binding specificity by representing sequence motifs, including the gaps and dependencies between binding-site residues, but these methods have not been systematically compared. Here we applied 26 such approaches to in vitro protein binding microarray data for 66 mouse TFs belonging to various families. For 9 TFs, we also scored the resulting motif models on in vivo data, and found that the best in vitro–derived motifs performed similarly to motifs derived from in vivo data. Our results indicate that simple models based on mononucleotide position weight matrices learned by the best methods perform similarly to more complex models for most TFs examined, but fall short in specific cases. In addition, the best-performing motifs typically have relatively low information content, consistent with widespread degeneracy in eukaryotic TF sequence preferences. Protein binding microarray (PBM) experiments were performed for a set of 86 mouse transcription factors. Briefly, the PBMs involved binding GST-tagged DNA-binding proteins to two double-stranded 44K Agilent microarrays, each containing a different DeBruijn sequence design, in order to determine their sequence preferences. Details of the PBM protocol are described in Berger et al., Nature Biotechnology 2006.
Project description:Genomic analyses often involve scanning for potential transcription-factor (TF) binding sites using models of the sequence specificity of DNA binding proteins. Many approaches have been developed to model and learn a protein’s binding specificity by representing sequence motifs, including the gaps and dependencies between binding-site residues, but these methods have not been systematically compared. Here we applied 26 such approaches to in vitro protein binding microarray data for 66 mouse TFs belonging to various families. For 9 TFs, we also scored the resulting motif models on in vivo data, and found that the best in vitro–derived motifs performed similarly to motifs derived from in vivo data. Our results indicate that simple models based on mononucleotide position weight matrices learned by the best methods perform similarly to more complex models for most TFs examined, but fall short in specific cases. In addition, the best-performing motifs typically have relatively low information content, consistent with widespread degeneracy in eukaryotic TF sequence preferences.
Project description:The DNA-binding activities of transcription factors (TFs) are influenced by both intrinsic sequence preferences and extrinsic interactions with cell-specific chromatin landscapes and other regulatory proteins. Disentangling the roles of these determinants in TF-DNA binding remains challenging. For instance, the FoxA subfamily of Forkhead domain TFs are known pioneer factors, yet their binding varies across cell types, pointing to a combination of intrinsic and extrinsic forces guiding their binding. How such sequence and chromatin influences vary across related Forkhead domain TFs remains mostly uncharacterized. Here, we present a principled approach to compare the relative contributions of intrinsic DNA sequence preference and cell-specific chromatin environments to a TF’s DNA-binding activities. We over-express a selection of Fox TFs in mouse embryonic stem (mES) cells, which offer a platform to contrast each TF's binding activity within the same preexisting chromatin background. By developing and applying a neural network that jointly models sequence and chromatin data, we can evaluate how sequence and preexisting chromatin features contribute to induced TF binding, both at individual sites and genome-wide. We demonstrate that Fox TFs bind different DNA targets, and drive differential gene expression patterns, even when induced in identical chromatin settings. Differential Fox binding activities can be attributed to distinct DNA-binding preferences coupled with differential abilities to engage relatively inaccessible chromatin. We propose that varying preferences for preexisting chromatin states enables the functional diversification of paralogous TFs.
Project description:The DNA-binding activities of transcription factors (TFs) are influenced by both intrinsic sequence preferences and extrinsic interactions with cell-specific chromatin landscapes and other regulatory proteins. Disentangling the roles of these determinants in TF-DNA binding remains challenging. For instance, the FoxA subfamily of Forkhead domain TFs are known pioneer factors, yet their binding varies across cell types, pointing to a combination of intrinsic and extrinsic forces guiding their binding. How such sequence and chromatin influences vary across related Forkhead domain TFs remains mostly uncharacterized. Here, we present a principled approach to compare the relative contributions of intrinsic DNA sequence preference and cell-specific chromatin environments to a TF’s DNA-binding activities. We over-express a selection of Fox TFs in mouse embryonic stem (mES) cells, which offer a platform to contrast each TF's binding activity within the same preexisting chromatin background. By developing and applying a neural network that jointly models sequence and chromatin data, we can evaluate how sequence and preexisting chromatin features contribute to induced TF binding, both at individual sites and genome-wide. We demonstrate that Fox TFs bind different DNA targets, and drive differential gene expression patterns, even when induced in identical chromatin settings. Differential Fox binding activities can be attributed to distinct DNA-binding preferences coupled with differential abilities to engage relatively inaccessible chromatin. We propose that varying preferences for preexisting chromatin states enables the functional diversification of paralogous TFs.
Project description:The DNA-binding activities of transcription factors (TFs) are influenced by both intrinsic sequence preferences and extrinsic interactions with cell-specific chromatin landscapes and other regulatory proteins. Disentangling the roles of these determinants in TF-DNA binding remains challenging. For instance, the FoxA subfamily of Forkhead domain TFs are known pioneer factors, yet their binding varies across cell types, pointing to a combination of intrinsic and extrinsic forces guiding their binding. How such sequence and chromatin influences vary across related Forkhead domain TFs remains mostly uncharacterized. Here, we present a principled approach to compare the relative contributions of intrinsic DNA sequence preference and cell-specific chromatin environments to a TF’s DNA-binding activities. We over-express a selection of Fox TFs in mouse embryonic stem (mES) cells, which offer a platform to contrast each TF's binding activity within the same preexisting chromatin background. By developing and applying a neural network that jointly models sequence and chromatin data, we can evaluate how sequence and preexisting chromatin features contribute to induced TF binding, both at individual sites and genome-wide. We demonstrate that Fox TFs bind different DNA targets, and drive differential gene expression patterns, even when induced in identical chromatin settings. Differential Fox binding activities can be attributed to distinct DNA-binding preferences coupled with differential abilities to engage relatively inaccessible chromatin. We propose that varying preferences for preexisting chromatin states enables the functional diversification of paralogous TFs.
Project description:The nematode Caenorhabditis elegans is a powerful model for studying gene regulation, as it has a compact genome and a wealth of genomic tools. However, identification of regulatory elements has been hampered by the fact that DNA binding motifs are known for only 71 (9%) of the estimated 763 high-confidence sequence-specific transcription factors (TFs). To address this problem, we performed protein binding microarray (PBM) experiments on representatives of canonical TF families in the C. elegans TF repertoire, obtaining motifs for 129 distinct TFs. Moreover, we can infer motifs for 97 additional TFs that have DNA binding domains that are very similar to those already characterized, resulting in a total coverage of binding specificities for almost 40% of the C. elegans TF repertoire. These data highlight the diversification of binding motifs for the nuclear hormone receptor (NHR) and C2H2 zinc finger families, and reveal unexpected diversity of motifs for others, including the T-box and DM families. Enrichment of motifs in the promoters of functionally related genes is consistent with known biology in many cases, and also identifies putative new regulatory roles for poorly characterized TFs. The motifs are available at http:// http://cisbp.ccbr.utoronto.ca. Protein binding microarray (PBM) experiments were performed for a set of 129 diverse C. elegans transcription factors. Briefly, the PBMs involved binding GST-tagged DNA-binding proteins to two double-stranded 44K Agilent microarrays, each containing a different DeBruijn sequence design, in order to determine their sequence preferences. Details of the PBM protocol are described in Berger et al., Nature Biotechnology 2006.
Project description:The human transcription factor (TF) CGGBP1 (“CGG Binding Protein”) is conserved only in amniotes, and is believed to derive from the zf-BED and Hermes transposase DNA-binding domains (DBDs) of a hAT DNA transposon. Here, we examine the DNA binding preferences of a wide variety of metazoan CGGBP1-like TFs with this bipartite domain using PBM experiments. The derived motifs are available at ...
Project description:We describe an effort (“Codebook”) to determine the sequence specificity of 332 putative and largely uncharacterized human transcription factors (TFs), as well as 61 control TFs. Nearly 5,000 independent experiments, including in vitro and in vivo assays, produced motifs for most of the uncharacterized TFs analyzed (180, or 53%), the vast majority of which are unique to a single TF. The data highlight the extensive contribution of transposable elements to TF evolution, both in cis and trans, and identify tens of thousands of conserved, base-level binding sites in the human genome. The use of multiple platforms provides an unprecedented opportunity to benchmark and analyze TF sequence specificity, function, and evolution, as further explored in accompanying manuscripts. Over 1,421 human TFs are now associated with a DNA binding motif. Extrapolation from the Codebook benchmarking suggests that many of the binding motifs for well-studied TFs may inaccurately describe the TF’s true sequence preferences.
Project description:We performed Assay for Transposase-Accessible Chromatin using sequencing (ATAC-seq) to profile genome-wide chromatin accessibility in the human H1 embryonic stem cell (ESC) line. We used this data to train a deep learning model called ChromBPNet which can accurately predict base-resolution accessibility profiles as a function of DNA sequence, while accounting for and correcting biases due the sequence preferences of the Tn5 transposase used in ATAC-seq. We interpreted the models to identify globally predictive transcription factor (TF) motifs, individual predictive motif instances in all accessible regions and Tn5-bias corrected canonical footprints of TFs at these predictive motifs.