Unknown,Transcriptomics,Genomics,Proteomics

Dataset Information

0

Stability selection for regression-based models of transcription factor-DNA binding specificity


ABSTRACT: Motivation: The DNA binding specificity of a transcription factor (TF) is typically represented using a position weight matrix (PWM) model, which implicitly assumes that individual bases in a TF binding site contribute independently to the binding affinity, an assumption that does not always hold. For this reason, more complex models of binding specificity have been developed. However, these models have their own caveats: they typically have a large number of parameters, which makes them hard to learn and interpret. Results: We propose novel regression-based models of TF-DNA binding specificity, trained using high resolution in vitro data from custom protein binding microarray (PBM) experiments. Our PBMs are specifically designed to cover a large number of putative DNA binding sites for the TFs of interest (yeast TFs Cbf1 and Tye7, and human TFs c-Myc, Max, and Mad2) in their native genomic context. These high-throughput, quantitative data are well suited for training complex models that take into account not only independent contributions from individual bases, but also contributions from di- and trinucleotides at various positions within or near the binding sites. To ensure that our models remain interpretable, we use feature selection to identify a small number of sequence features that accurately predict TF-DNA binding specificity. To further illustrate the accuracy of our regression models, we show that even in the case of paralogous TF with highly similar PWMs, our new models can distinguish the specificities of individual factors. Thus, our work represents an important step towards better sequence-based models of individual TF-DNA binding specificity. Four protein binding microarray (PBM) experiments of human transcription factors were performed. Briefly, the PBMs involved binding GST-tagged transcription factors c-Myc, Max, and Mad2(Mxi1) to double-stranded 180K Agilent microarrays in order to determine their binding specificity for putative DNA binding sites in native genomic context. Briefly, we represent three categories of 36-bp sequences: 1) bound probes, 2) unbound probes (or negative controls), and 3) test probes. Bound probes corresponded to genomic regions bound in vivo by c-Myc, Max, or Mad2 (ChIP-seq P < 10^(-10) in HeLaS3 or K562 celld (ENCODE)) that contain at least two consecutive 8-mers with universal PBM E-score > 0.4 (Munteanu and Gordan, LNCS 2013). All putative binding sites occurr at the same position within the probes on the array. M-bM-^@M-^\UnboundM-bM-^@M-^] probes corresponded to genomic regions with ChIP-seq P < 10^(-10) and a maximum 8-mer E-score < 0.2. We also designed test probes that contain, within constant flanking regions, all nnCACGTGnn 10-mers and 18 nnnCACGTGnnn 12-mers (where n = A, C, G, or T). Each DNA sequence represented on the array is present in 6 replicate spots. We report the PBM signal intensity for each spot. The PBM protocol is described in Berger et al., Nature Biotechnology 2006 (PMID 16998473).

ORGANISM(S): Homo sapiens

SUBMITTER: Raluca Gordan 

PROVIDER: E-GEOD-47026 | biostudies-arrayexpress |

REPOSITORIES: biostudies-arrayexpress

Similar Datasets

2013-03-29 | E-GEOD-44436 | biostudies-arrayexpress
2013-03-29 | E-GEOD-44437 | biostudies-arrayexpress
2014-11-04 | E-GEOD-59845 | biostudies-arrayexpress
2014-10-01 | E-GEOD-61920 | biostudies-arrayexpress
2011-12-10 | E-GEOD-34306 | biostudies-arrayexpress
2014-09-22 | E-GEOD-58570 | biostudies-arrayexpress
2012-12-13 | E-GEOD-42864 | biostudies-arrayexpress
2015-04-23 | E-GEOD-65719 | biostudies-arrayexpress
2012-01-28 | E-GEOD-35380 | biostudies-arrayexpress
2016-07-17 | E-GEOD-83442 | biostudies-arrayexpress