Dataset Information

Efficiently identifying genome-wide changes with next-generation sequencing data.

ABSTRACT: We propose a new and effective statistical framework for identifying genome-wide differential changes in epigenetic marks with ChIP-seq data or gene expression with mRNA-seq data, and we develop a new software tool EpiCenter that can efficiently perform data analysis. The key features of our framework are: (i) providing multiple normalization methods to achieve appropriate normalization under different scenarios, (ii) using a sequence of three statistical tests to eliminate background regions and to account for different sources of variation and (iii) allowing adjustment for multiple testing to control false discovery rate (FDR) or family-wise type I error. Our software EpiCenter can perform multiple analytic tasks including: (i) identifying genome-wide epigenetic changes or differentially expressed genes, (ii) finding transcription factor binding sites and (iii) converting multiple-sample sequencing data into a single read-count data matrix. By simulation, we show that our framework achieves a low FDR consistently over a broad range of read coverage and biological variation. Through two real examples, we demonstrate the effectiveness of our framework and the usages of our tool. In particular, we show that our novel and robust 'parsimony' normalization method is superior to the widely-used 'tagRatio' method. Our software EpiCenter is freely available to the public.

SUBMITTER: Huang W

PROVIDER: S-EPMC3201882 | biostudies-literature | 2011 Oct

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Efficiently identifying genome-wide changes with next-generation sequencing data.

Huang Weichun W Umbach David M DM Vincent Jordan Nicole N Abell Amy N AN Johnson Gary L GL Li Leping L

Nucleic acids research 20110729 19

We propose a new and effective statistical framework for identifying genome-wide differential changes in epigenetic marks with ChIP-seq data or gene expression with mRNA-seq data, and we develop a new software tool EpiCenter that can efficiently perform data analysis. The key features of our framework are: (i) providing multiple normalization methods to achieve appropriate normalization under different scenarios, (ii) using a sequence of three statistical tests to eliminate background regions an ...[more]

PMID: 21803788

Similar Datasets

Project description:Currently, there are many publicly available Next Generation Sequencing tools developed for variant annotation and classification. However, as modern sequencing technology produces more and more sequencing data, a more efficient analysis program is desired, especially for variant analysis. In this study, we updated SNPAAMapper, a variant annotation pipeline by converting perl codes to python for generating annotation output with an improved computational efficiency and updated information for broader applicability. The new pipeline written in Python can classify variants by region (Coding Sequence, Untranslated Regions, upstream, downstream, intron), predict amino acid change type (missense, nonsense, etc.), and prioritize mutation effects (e.g., synonymous > non-synonymous) while being faster and more efficient. Our new pipeline works in five steps. First, exon annotation files are generated. Next, the exon annotation files are processed, and gene mapping and feature information files are produced. Afterward, the python scrips classify the variants based on genomic regions and predict the amino acid change category. Lastly, another python script prioritizes and ranks the mutation effects of variants to output the result file. The Python version of SNPAAMapper accomplished the overall speed by running most annotation steps in a substantially shorter time. The Python script can classify variants by region in 53 s compared to 166 s for the Perl script in a test sample run on a Latitude 7480 Desktop computer with 8GB RAM and an Intel Core i5-6300 CPU @ 2.4Ghz. Steps of predicting amino acid change type and prioritizing mutation effects of variants were executed within 1 s for both pipelines. SNPAAMapper-Python was developed and tested on the ClinVar database, a NCBI database of information on genomic variation and its relationship to human health. We believe our developed Python version of SNPAAMapper variant annotation pipeline will benefit the community by elucidating the variant consequence and speed up the discovery of causative genetic variants through whole genome/exome sequencing. Source codes, test data files, instructions, and further explanations are available on the web at https://github.com/BaiLab/SNPAAMapper-Python.

Project description:Pancreatic adenocarcinoma (PAC) is among the most lethal malignancies. While research has implicated multiple genes in disease pathogenesis, identification of therapeutic leads has been difficult and the majority of currently available therapies provide only marginal benefit. To address this issue, our goal was to genomically characterize individual PAC patients to understand the range of aberrations that are occurring in each tumor. Because our understanding of PAC tumorigenesis is limited, evaluation of separate cases may reveal aberrations, that are less common but may provide relevant information on the disease, or that may represent viable therapeutic targets for the patient. We used next generation sequencing to assess global somatic events across 3 PAC patients to characterize each patient and to identify potential targets. This study is the first to report whole genome sequencing (WGS) findings in paired tumor/normal samples collected from 3 separate PAC patients. We generated on average 132 billion mappable bases across all patients using WGS, and identified 142 somatic coding events including point mutations, insertion/deletions, and chromosomal copy number variants. We did not identify any significant somatic translocation events. We also performed RNA sequencing on 2 of these patients' tumors for which tumor RNA was available to evaluate expression changes that may be associated with somatic events, and generated over 100 million mapped reads for each patient. We further performed pathway analysis of all sequencing data to identify processes that may be the most heavily impacted from somatic and expression alterations. As expected, the KRAS signaling pathway was the most heavily impacted pathway (P<0.05), along with tumor-stroma interactions and tumor suppressive pathways. While sequencing of more patients is needed, the high resolution genomic and transcriptomic information we have acquired here provides valuable information on the molecular composition of PAC and helps to establish a foundation for improved therapeutic selection.

Dataset Information

Efficiently identifying genome-wide changes with next-generation sequencing data.

Publications

Efficiently identifying genome-wide changes with next-generation sequencing data.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets