Unknown,Transcriptomics,Genomics,Proteomics

Dataset Information

Quantitative modeling of transcription factor binding specificities using DNA shape

ABSTRACT: Accurate predictions of the DNA binding specificities of transcription factors (TFs) are necessary for understanding gene regulatory mechanisms. Traditionally, predictive models are built based on nucleotide sequence features. Here, we employed three- dimensional DNA shape information obtained on a high-throughput basis to integrate intuitive DNA structural features into the modeling of TF binding specificities using support vector regression. We performed quantitative predictions of DNA binding specificities, using the DREAM5 dataset for 65 mouse TFs and genomic-context protein binding microarray data for three human basic helix-loop-helix TFs. DNA shape-augmented models compared favorably with sequence-based models for these predictions. Although both k-mer and DNA shape features encoded the interdependencies between nucleotide positions of the binding site, using DNA shape features reduced the dimensionality of the feature space compared to k-mer use. Finally, analyzing the weights of DNA shape-augmented models uncovered TF family- specific structural readout mechanisms that were not obvious from the nucleotide sequence. Three genomic-context protein binding microarray (gcPBM) experiments of human transcription factors were performed. Briefly, the gcPBMs involved binding his-tagged transcription factors c-Myc, Max, and Mad1(Mxd1) to double-stranded 180K Agilent microarrays in order to determine their binding specificity for putative DNA binding sites in native genomic context. Briefly, we represent three categories of 36-bp sequences: 1) bound probes, 2) unbound probes (or negative controls), and 3) test probes. Bound probes corresponded to genomic regions bound in vivo by c-Myc, Max, or Mad2 (ChIP-seq P < 10^(-10) in HeLaS3 or K562 celld (ENCODE)) that contain at least two consecutive 8-mers with universal PBM E-score > 0.4 (Munteanu and Gordan, LNCS 2013). All putative binding sites occur at the same position within the probes on the array. M-bM-^@M-^\UnboundM-bM-^@M-^] probes corresponded to genomic regions with ChIP-seq P < 10^(-10) and a maximum 8-mer E-score < 0.2. We also designed test probes that contain, within constant flanking regions, all nnCACGTGnn 10-mers and 18 nnnCACGTGnnn 12-mers (where n = A, C, G, or T). Each DNA sequence represented on the array is present in 6 replicate spots. We report the gcPBM signal intensity for each spot. The PBM protocol is described in Berger et al., Nature Biotechnology 2006 (PMID 16998473).

ORGANISM(S): Homo sapiens

SUBMITTER: Raluca Gordan

PROVIDER: E-GEOD-59845 | biostudies-arrayexpress |

REPOSITORIES: biostudies-arrayexpress

ACCESS DATA

Publications

Protein-DNA binding in the absence of specific base-pair recognition.

Afek Ariel A Schipper Joshua L JL Horton John J Gordân Raluca R Lukatsky David B DB

Proceedings of the National Academy of Sciences of the United States of America 20141013 48

Until now, it has been reasonably assumed that specific base-pair recognition is the only mechanism controlling the specificity of transcription factor (TF)-DNA binding. Contrary to this assumption, here we show that nonspecific DNA sequences possessing certain repeat symmetries, when present outside of specific TF binding sites (TFBSs), statistically control TF-DNA binding preferences. We used high-throughput protein-DNA binding assays to measure the binding levels and free energies of binding ...[more]

PMID: 25313048

Similar Datasets

Project description:Until now, it has been reasonably assumed that specific base-pair recognition is the only mechanism controlling the specificity of transcription factor (TF)M-bM-^HM-^RDNA binding. Contrary to this assumption, here we show that nonspecific DNA sequences possessing certain repeat symmetries, when present outside of specific TF binding sites (TFBSs), statistically control TFM-bM-^HM-^RDNA binding preferences. We used high-throughput proteinM-bM-^HM-^RDNA binding assays to measure the binding levels and free energies of binding for several human TFs to tens of thousands of short DNA sequences with varying re- peat symmetries. Based on statistical mechanics modeling, we iden- tify a new proteinM-bM-^HM-^RDNA binding mechanism induced by DNA se- quence symmetry in the absence of specific base-pair recognition, and experimentally demonstrate that this mechanism indeed gov- erns proteinM-bM-^HM-^RDNA binding preferences. Four custom protein binding microarray (PBM) experiments of human transcription factors were performed. Briefly, the PBMs involved binding his-tagged transcription factors c-Myc, Max, and Mad1(Mxd1) to double-stranded 180K Agilent microarrays in order to determine their binding specificity for GTCACGTGAC DNA binding sites flanked by repetitive DNA elements with different symmetries and correlation length scales. Briefly, we represent three categories of 36-bp sequences: 1) 28800 probes centered at a GTCACGTGAC site and flanked by repetitive elements (probe names starting with Ariel_); 2) Unbound probes (or negative controls); and 3) Bound probes, which correspond to randomly selected genomic regions bound in vivo by c-Myc, Max, or Mad2 (ChIP-seq P < 10^(-10) in HeLaS3 or K562 celld (ENCODE)), which contain at least two consecutive 8-mers with universal PBM E-score > 0.4 (Munteanu and Gordan, LNCS 2013). Each DNA sequence represented on the array is present in 6 replicate spots. We report the gcPBM signal intensity for each spot (raw files) as well as the median intensity over the 6 replicate spots (normalized data). The PBM protocol is described in Berger et al., Nature Biotechnology 2006 (PMID 16998473).

Project description:The Myc-Max heterodimer is a DNA binding protein that regulates expression of a large number of genes. Genome occupancy of Myc-Max is thought to be driven by E-boxes (CACGTG or variants) to which the heterodimer binds in vitro. By analyzing ChIP-Seq datasets, we demonstrated that the positions occupied by Myc-Max across the human genome correlate with the RNA polymerase II (Pol II) transcription machinery better than with E-boxes. Metagene analyses showed that in promoter regions, Myc was uniformly positioned about 100 bp upstream of essentially all promoter proximal paused polymerases with Max about 10 bp upstream of Myc. We re-evaluated the DNA binding properties of full length Myc-Max proteins using electrophoretic mobility shift assays (EMSA) and protein-binding microarrays (PBM). EMSA results demonstrated Myc-Max heterodimers have high affinity for both E-box containing and non-specific DNA. Quantification of the relative affinities of Myc-Max for all possible 8- mers using PBM assays showed that sequences surrounding core 6-mers significantly affect binding. Comparing to the in vitro sequence preferences, Myc-Max genomic occupancy measured by ChIP-Seq was largely, although not completely, independent of sequence specificity. Our results suggest that the transcription machinery and associated promoter accessibility play an important role in genomic occupancy of Myc. Two protein binding microarray (PBM) experiments were performed: one for the heterodimer of the human transcription factors c-Myc and Max, and one for the Max-Max homodimer. Briefly, 4x44K arrays (Agilent Technologies; AmadID 015681) containing the M-bM-^@M-^Xall 10-merM-bM-^@M-^Y universal PBM design were used. Arrays were incubated with a PBS buffer based protein mixture of wither 10nM His-tagged Myc-Max heterodimer or 10nM His-tagged Max-Max homodimer, 2% milk, 200ng/M-BM-5L BSA, 50ng/M-BM-5L Salmon Testes DNA, and 0.02% TX-100. Bound protein was tagged with 10ng/M-BM-5L anti-His antibody conjugated to Alexa 488 (Qiagen; 35310) in PBS with 2% milk. Data were analyzed to obtain fluorescence intensities for all 8mers. The PBM protocol is described in Berger et al., Nature Biotechnology 2006 (PMID 16998473).

Project description:DNA sequence is a major determinant of the binding specificity of transcription factors (TFs) for their genomic targets. However, eukaryotic cells often express, at the same time, TFs with highly similar DNA binding motifs but distinct in vivo targets. Currently, it is not well understood how TFs with seemingly identical DNA motifs achieve unique specificities in vivo. Here, we used custom protein binding microarrays to analyze TF specificity for putative binding sites in their genomic sequence context. Using yeast TFs Cbf1 and Tye7 as our case study, we found that binding sites of these bHLH TFs (i.e., E-boxes) are bound differently in vitro and in vivo, depending on their genomic context. Computational analyses suggest that nucleotides outside E-box binding sites contribute to specificity by influencing the 3D structure of DNA binding sites. Thus, local shape of target sites might play a widespread role in achieving regulatory specificity within TF families. Three protein binding microarray (PBM) experiments of Saccharomyces cerevisiae transcription factors were performed. Briefly, the PBMs involved binding GST-tagged yeast transcription factors Cbf1 and Tye7 to double-stranded 44K Agilent microarrays in order to determine their binding specificity for putative DNA binding sites in native genomic context. Briefly, we represent three categories of 30-bp genomic sequences: 1) ChIP-chip bound probes, 2) ChIP-chip unbound probes, and 3) negative control probes. ChIP-chip bound probes corresponded to genomic regions bound in vivo by Cbf1 or Tye7 (ChIP-chip P < 0.005 in rich medium (YPD) (Harbison et al., Nature 2004, PMID 15343339)) contained at least two consecutive 8-mers with universal PBM E-score > 0.35 (Zhu et al., Genome Research 2009, PMID 19158363). All putative binding sites occurred at the same position within the probes on the array. M-bM-^@M-^\ChIP-chip unboundM-bM-^@M-^] probes corresponded to genomic regions with ChIP-chip P > 0.5 and at least two consecutive 8-mers at a more stringent universal PBM E-score threshold of 0.4. Negative control probes corresponded to S. cerevisiae intergenic regions with a maximum 8-mer E-score < 0.3. We also designed probes that contain, within constant flanking regions, all 10-bp sequences that occur within the M-bM-^@M-^\ChIP-chip boundM-bM-^@M-^] probes and contain the E-box CACGTG, but are flanked by synthetic rather than native genomic sequence. Each DNA sequence represented on the array is present in 4 replicate spots. We report the PBM signal intensity for each spot. The PBM protocol is described in Berger et al., Nature Biotechnology 2006 (PMID 16998473).

Project description:Motivation: The DNA binding specificity of a transcription factor (TF) is typically represented using a position weight matrix (PWM) model, which implicitly assumes that individual bases in a TF binding site contribute independently to the binding affinity, an assumption that does not always hold. For this reason, more complex models of binding specificity have been developed. However, these models have their own caveats: they typically have a large number of parameters, which makes them hard to learn and interpret. Results: We propose novel regression-based models of TF-DNA binding specificity, trained using high resolution in vitro data from custom protein binding microarray (PBM) experiments. Our PBMs are specifically designed to cover a large number of putative DNA binding sites for the TFs of interest (yeast TFs Cbf1 and Tye7, and human TFs c-Myc, Max, and Mad2) in their native genomic context. These high-throughput, quantitative data are well suited for training complex models that take into account not only independent contributions from individual bases, but also contributions from di- and trinucleotides at various positions within or near the binding sites. To ensure that our models remain interpretable, we use feature selection to identify a small number of sequence features that accurately predict TF-DNA binding specificity. To further illustrate the accuracy of our regression models, we show that even in the case of paralogous TF with highly similar PWMs, our new models can distinguish the specificities of individual factors. Thus, our work represents an important step towards better sequence-based models of individual TF-DNA binding specificity. Four protein binding microarray (PBM) experiments of human transcription factors were performed. Briefly, the PBMs involved binding GST-tagged transcription factors c-Myc, Max, and Mad2(Mxi1) to double-stranded 180K Agilent microarrays in order to determine their binding specificity for putative DNA binding sites in native genomic context. Briefly, we represent three categories of 36-bp sequences: 1) bound probes, 2) unbound probes (or negative controls), and 3) test probes. Bound probes corresponded to genomic regions bound in vivo by c-Myc, Max, or Mad2 (ChIP-seq P < 10^(-10) in HeLaS3 or K562 celld (ENCODE)) that contain at least two consecutive 8-mers with universal PBM E-score > 0.4 (Munteanu and Gordan, LNCS 2013). All putative binding sites occurr at the same position within the probes on the array. M-bM-^@M-^\UnboundM-bM-^@M-^] probes corresponded to genomic regions with ChIP-seq P < 10^(-10) and a maximum 8-mer E-score < 0.2. We also designed test probes that contain, within constant flanking regions, all nnCACGTGnn 10-mers and 18 nnnCACGTGnnn 12-mers (where n = A, C, G, or T). Each DNA sequence represented on the array is present in 6 replicate spots. We report the PBM signal intensity for each spot. The PBM protocol is described in Berger et al., Nature Biotechnology 2006 (PMID 16998473).

Project description:Transcription factors (TFs) play a central role in regulating gene expression by interacting with cis regulatory DNA elements associated with their target genes. Recent surveys have examined the DNA binding specificities of most Saccharomyces cerevisiae transcription factors but a comprehensive evaluation of their data has been lacking. Results: We analyzed in vitro and in vivo TF-DNA binding data reported in previous large-scale studies to generate a comprehensive, curated resource of DNA binding specificity data for all characterized S. cerevisiae transcription factors. Our collection comprises DNA binding site motifs and comprehensive in vitro DNA binding specificity data for all possible 8 bp sequences. Included in this database is DNA binding specificity data for 27 TFs independently generated by PBM analysis in this current study. Investigation of the DNA binding specificities within the basic leucine zipper (bZIP) and VHR transcription factor families revealed unexpected plasticity in TF-DNA recognition: intriguingly, the VHR transcription factors, newly characterized by protein binding microarrays in this study, recognize bZIP like DNA motifs, while the bZIP transcription factor Hac1 recognizes a motif highly similar to the canonical E-box motif of basic helix-loop-helix (bHLH) transcription factors. We identified several transcription factors with distinct primary and secondary motifs, which might be associated with different regulatory functions. Finally, integrated analysis of in vivo transcription factor binding data with protein binding microarray data lends further support for indirect DNA binding in vivo by sequence-specific transcription factors. 27 Protein binding microarray (PBM) experiments of Saccharomyces cerevisiae transcription factors were performed. Briefly, the PBMs involved binding GST-tagged yeast transcription factors to double-stranded 44K Agilent microarrays in order to determine their sequence preferences. The method is described in Berger et al., Nature Biotechnology 2006 (PMID 16998473). A key feature is that the microarrays are composed of de Bruijn sequences that contain each 10-base sequence once and only once, providing an evenly balanced sequence distribution. Individual de Bruijn sequences have different properties, including representation of gapped patterns. The array probe sequences on the custom array design used in this study were reported previously in Berger et al., Cell 2008 (PMID 18585359) and are available via an academic research use license. Here we provide the data transformed into median signal intensities (after normalization and detrending of the original array data) for all 32,896 8-base sequences, Z-scores for these intensities, and E-scores. E-scores are a modified version of AUC and describe how well each 8-mer ranks the intensities of the spots. In general, the E-scores are slightly more reproducible than Z-scores, but contain less information about relative binding affinity. Additional experimental details are found in Berger et al., Nature Biotechnology 2006, Gordan et al., Genome Biology (in press), and the accompanying Supplementary information.

Project description:The sequence specificity of DNA-binding proteins is the primary mechanism by which the cell recognizes genomic features. Here, we describe systematic determination of yeast transcription factor DNA-binding specificities. We obtained binding specificities for 112 DNA-binding proteins representing 19 distinct structural classes. One-third of the binding specificities have not been previously reported. Several binding sequences have striking genomic distributions relative to transcription start sites, supporting their biological relevance and suggesting a role in promoter architecture. Among these are Rsc3 binding sequences, containing the core CGCG, which are found preferentially ~100 bp upstream of transcription start sites. Mutation of RSC3 results in a dramatic increase in nucleosome occupancy in hundreds of proximal promoters containing a Rsc3 binding element, but has little impact on promoters lacking Rsc3 binding sequences, indicating that Rsc3 plays a broad role in targeting nucleosome exclusion at yeast promoters. Keywords: Protein binding microarrays, DNA, proteins Protein binding microarray (PBM), ChIP-chip and DIP-chip experiments of yeast transcription factor DNA-binding domains were performed. Briefly, the PBMs involved binding GST-tagged DNA-binding proteins to custom-designed, double-stranded 44K Agilent microarrays in order to determine their sequence preferences. The method is described in Berger et al., Nature Biotechnology 2006. A key feature is that the microarrays are composed of de Bruijn sequences that contain each 10-base sequence once and only once, providing an evenly balanced sequence distribution. Individual de Bruijn sequences have different properties, including representation of gapped patterns. Here we provide the data transformed into median intensities for all 32,896 8-base sequences, Z-scores for these intensities, and E-scores. E-scores are a modified version of AUC, and describe how well each 8-mer ranks the intensities of the spots. In general the E-scores are slightly more reproducible than Z-scores, but contain less information about relative binding affinity. Additional experimental details are found in Berger et al., Nature Biotechnology 2006, Berger et al., Cell 2008, and the accompanying Supplementary information. Raw 35-mer array data is available on the web link provided. For the DIP-chip experiments [GSM345371, GSM345403, GSM345414-GSM345421, GSM345429-GSM345432], genomic DNA isolated from S288C yeast was incubated with 40nM of the MBP-tagged DNA binding domain (DBD) of either Cbf1, Pho2, Pho4, Leu3, Rap1, or Swi5 and incubated for 30 minutes prior to purification of protein-DNA complexes. The bound DNA was then isolated, amplified via Invitrogen's WGA protocol, and hybridized against input DNA on NimbleGen 385k 32bp-tiling whole genome arrays. ChIPOTle was used to identify peaks of binding from the data and motifs were identified by BioProspector and MDScan and then scored for their ability to predict the identified peaks by GOMER. Motifs with the best ROC AUC are reported in the paper. For the ChIP-chip experiments [GSM346493 and GSM346494], isogenic wildtype and rsc3-1 strains carrying Rsc8-TAP were grown in parallel under rsc3-1 restrictive growth conditions (37Â°C). Following formaldehyde crosslinking, cells were homogenized and extracts were sonicated to shear the chromatin to an average size of ~500 bp. A single pulldown was then performed with IgG sepharose beads and after decrosslinking and LM-PCR amplification of purified IP DNA, samples were labeled and hybridized on Nimblegen 32bp whole genome tiling arrays, comparing the pulled-down DNA to input genomic DNA.

Project description:Most homeodomains are unique within a genome, yet many are highly conserved across vast evolutionary distances, implying strong selection on their precise DNA-binding specificities. We determined the binding preferences of the majority (168) of mouse homeodomains to all possible 8-base sequences, revealing rich and complex patterns of sequence specificity, and showing for the first time that there are at least 65 distinct homeodomain DNA-binding activities. We developed a computational system that successfully predicts binding sites for homeodomain proteins as distant from mouse as Drosophila and C. elegans, and we infer full 8-mer binding profiles for the majority of known animal homeodomains. Our results provide an unprecedented level of resolution in the analysis of this simple domain structure and suggest that variation in sequence recognition may be a factor in its functional diversity and evolutionary success. Keywords: Mouse homeodomain protein binding microarrays 178 Protein binding microarray (PBM) experiments of mouse homeodomains were performed, with 10 proteins done in replicate. Briefly, the PBMs involved binding GST-tagged mouse homeodomains to custom-designed, double-stranded 44K Agilent microarrays in order to determine their sequence preferences. The method is described in Berger et al., Nature Biotechnology 2006. A key feature is that the microarrays are composed of de Bruijn sequences that contain each 10-base sequence once and only once, providing an evenly balanced sequence distribution. Individual de Bruijn sequences have different properties, including representation of gapped patterns. The array sequences as well as the primary array data are available via a EULA at http://the_brain.bwh.harvard.edu/pbms/webworks2/. Here we provide the data transformed into median intensities (after normalization and detrending of the original array data) for all 32,896 8-base sequences, Z-scores for these intensities, and E-scores. E-scores are a modified version of AUC, and describe how well each 8-mer ranks the intensities of the spots. In general the E-scores are slightly more reproducible than Z-scores, but contain less information about relative binding affinity. Additional experimental details are found in Berger et al., Nature Biotechnology 2006, Berger et al., Cell 2008, and the accompanying Supplementary information.

Dataset Information

Quantitative modeling of transcription factor binding specificities using DNA shape

Publications

Protein-DNA binding in the absence of specific base-pair recognition.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets