Transcription Factor Binding Sites by ChIP-seq from ENCODE/HAIB
Ontology highlight
ABSTRACT: This data was generated by ENCODE. If you have questions about the data, contact the submitting laboratory directly (Florencia Pauli mailto:fpauli@hudsonalpha.org). If you have questions about the Genome Browser track associated with this data, contact ENCODE (mailto:genome@soe.ucsc.edu). The ChIP-Seq method was used to assay chromatin fragments bound by specific or general transcription factors as described below. DNA isolated by ChIP-Seq was size-selected (~225 bp) and sequenced. Short reads of 25-36 bp were mapped to the human reference genome, and enriched regions of high read density relative to a total input chromatin control reads were identified. The sequence reads with quality scores (fastq files) and alignment coordinates (BAM files) from these experiments are available for download. For data usage terms and conditions, please refer to http://www.genome.gov/27528022 and http://www.genome.gov/Pages/Research/ENCODE/ENCODEDataReleasePolicyFinal2008.pdf Cells were grown according to the approved ENCODE cell culture protocols (http://hgwdev.cse.ucsc.edu/ENCODE/protocols/cell). Cross-linked chromatin was immunoprecipitated with an antibody. The Protein:DNA crosslinks were then reversed and the DNA fragments were recovered and sequenced. Please see protocol notes below and go to http://hudsonalpha.org/myers-lab/protocols for the most current version of the protocol. Biological replicates from each experiment were completed. Libraries were sequenced with an Illumina Genome Analyzer I or an Illumina Genome Analyzer IIx according to the manufacturer's recommendations. Sequence data produced by the Illumina data pipeline software were quality filtered and then mapped to NCBI Build37 (hg19) using the integrated Eland software; 32 nt of the sequence reads were used for alignment; up to two mismatches were tolerated; reads that mapped to multiple sites in the genome were discarded. To identify likely binding sites, peak calling was applied to the aligned sequence data sets using Model-based Analysis of Chip-Seq MACS (Zhang Y, et al., 2008) (http://liulab.dfci.harvard.edu/MACS/00README.html). MACS models the shift size of ChIP-Seq tags empirically, and uses it to improve the spatial resolution of predicted binding sites. MACS also uses a dynamic Poisson distribution to capture local biases in the genome, allowing for more robust predictions (Zhang Y, et al., 2008). Protocol Notes: Several changes and improvements were made to the original ChIP-Seq protocol (Jonshon et al.,2008). The major differences between protocols are the number of cells and magnetic beads used for IP, the method of sonication used to fragment DNA, and the number of cycles of PCR used to amplify the sequencing library. The most current protocol used by the Myers lab can be found at http://hudsonalpha.org/myers-lab/protocols. The protocol field for each file denotes the version of the protocol used as being PCR1x, PCR2x or a version number (for examples, v041610.1). The sequencing libraries labeled as PCR2x were made with two rounds of amplification (25 and 15 cycles) and those labeled as PCR1x were made with one 15-cycle round of amplification. These experiments were completed prior to January 2010 and were originally aligned to NCBI Build36 (hg18). They have been re-aligned to NCBI Build37 (hg19) with the Bowtie software (Langmead, et al., 2009) for this data release (http://bowtie-bio.sourceforge.net/index.shtml). The libraries labeled with a protocol version number were competed after January 2010 and were only aligned to NCBI Build37 (hg19). Please refer to the Myers Lab website (http://hudsonalpha.org/myers-lab/protocols) for details on each protocol version. Verification: The MACS (http://liulab.dfci.harvard.edu/MACS/00README.html) peak caller was used to call significant peaks on the individual replicates of a ChIP-Seq experiment. Afterwards, the irreproducible discovery rate (IDR) method, developed by Li et al. (submitted), was used to quantify the consistency between pairs of ranked peaks lists from replicates. The IDR methods uses a model that assumes that the ranked lists of peaks in a pair of replicates consist of two groups - a reproducible group and an irreproducible group. In general, the signals in the reproducible group are more consistent (i.e. with a larger rank correlation coefficient) and are ranked higher than the irreproducible group. The proportion of peaks that belong to the irreproducible component and the correlation of the reproducible component are estimated adaptively from the data. The model also provides an IDR score for each peak, which reflects the posterior probability of the peak belonging to the irreproducible group. The aligned reads were pooled from all replicates and the MACS peak caller was used to call significant peaks on the pooled data. Only datasets containing at least 100 peaks passing the IDR threshold are considered valid and submitted for release.
ORGANISM(S): Homo sapiens
SUBMITTER: ENCODE DCC
PROVIDER: E-GEOD-32465 | biostudies-arrayexpress |
REPOSITORIES: biostudies-arrayexpress
ACCESS DATA