Dataset Information

Finding the active genes in deep RNA-seq gene expression studies.

ABSTRACT: Early application of second-generation sequencing technologies to transcript quantitation (RNA-seq) has hinted at a vast mammalian transcriptome, including transcripts from nearly all known genes, which might be fully measured only by ultradeep sequencing. Subsequent studies suggested that low-abundance transcripts might be the result of technical or biological noise rather than active transcripts; moreover, most RNA-seq experiments did not provide enough read depth to generate high-confidence estimates of gene expression for low-abundance transcripts. As a result, the community adopted several heuristics for RNA-seq analysis, most notably an arbitrary expression threshold of 0.3 - 1 FPKM for downstream analysis. However, advances in RNA-seq library preparation, sequencing technology, and informatic analysis have addressed many of the systemic sources of uncertainty and undermined the assumptions that drove the adoption of these heuristics. We provide an updated view of the accuracy and efficiency of RNA-seq experiments, using genomic data from large-scale studies like the ENCODE project to provide orthogonal information against which to validate our conclusions.We show that a human cell's transcriptome can be divided into active genes carrying out the work of the cell and other genes that are likely the by-products of biological or experimental noise. We use ENCODE data on chromatin state to show that ultralow-expression genes are predominantly associated with repressed chromatin; we provide a novel normalization metric, zFPKM, that identifies the threshold between active and background gene expression; and we show that this threshold is robust to experimental and analytical variations.The zFPKM normalization method accurately separates the biologically relevant genes in a cell, which are associated with active promoters, from the ultralow-expression noisy genes that have repressed promoters. A read depth of twenty to thirty million mapped reads allows high-confidence quantitation of genes expressed at this threshold, providing important guidance for the design of RNA-seq studies of gene expression. Moreover, we offer an example for using extensive ENCODE chromatin state information to validate RNA-seq analysis pipelines.

SUBMITTER: Hart T

PROVIDER: S-EPMC3870982 | biostudies-literature | 2013 Nov

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Finding the active genes in deep RNA-seq gene expression studies.

Hart Traver T Komori H Kiyomi HK LaMere Sarah S Podshivalova Katie K Salomon Daniel R DR

BMC genomics 20131111

<h4>Background</h4>Early application of second-generation sequencing technologies to transcript quantitation (RNA-seq) has hinted at a vast mammalian transcriptome, including transcripts from nearly all known genes, which might be fully measured only by ultradeep sequencing. Subsequent studies suggested that low-abundance transcripts might be the result of technical or biological noise rather than active transcripts; moreover, most RNA-seq experiments did not provide enough read depth to generat ...[more]

PMID: 24215113

Similar Datasets

Project description:BACKGROUND: Siraitia grosvenorii (Luohanguo) is an herbaceous perennial plant native to southern China and most prevalent in Guilin city. Its fruit contains a sweet, fleshy, edible pulp that is widely used in traditional Chinese medicine. The major bioactive constituents in the fruit extract are the cucurbitane-type triterpene saponins known as mogrosides. Among them, mogroside V is nearly 300 times sweeter than sucrose. However, little is known about mogrosides biosynthesis in S. grosvenorii, especially the late steps of the pathway. RESULTS: In this study, a cDNA library generated from of equal amount of RNA taken from S. grosvenorii fruit at 50 days after flowering (DAF) and 70 DAF were sequenced using Illumina/Solexa platform. More than 48,755,516 high-quality reads from a cDNA library were generated that was assembled into 43,891 unigenes. De novo assembly and gap-filling generated 43,891 unigenes with an average sequence length of 668 base pairs. A total of 26,308 (59.9%) unique sequences were annotated and 11,476 of the unique sequences were assigned to specific metabolic pathways by the Kyoto Encyclopedia of Genes and Genomes. cDNA sequences for all of the known enzymes involved in mogrosides backbone synthesis were identified from our library. Additionally, a total of eighty-five cytochrome P450 (CYP450) and ninety UDP-glucosyltransferase (UDPG) unigenes were identified, some of which appear to encode enzymes responsible for the conversion of the mogroside backbone into the various mogrosides. Digital gene expression profile (DGE) analysis using Solexa sequencing was performed on three important stages of fruit development, and based on their expression pattern, seven CYP450s and five UDPGs were selected as the candidates most likely to be involved in mogrosides biosynthesis. CONCLUSION: A combination of RNA-seq and DGE analysis based on the next generation sequencing technology was shown to be a powerful method for identifying candidate genes encoding enzymes responsible for the biosynthesis of novel secondary metabolites in a non-model plant. Seven CYP450s and five UDPGs were selected as potential candidates involved in mogrosides biosynthesis. The transcriptome data from this study provides an important resource for understanding the formation of major bioactive constituents in the fruit extract from S. grosvenorii.

Project description:BackgroundNeuromuscular junctions (NMJs) are chemical synapses formed between motor neurons and skeletal muscle fibers and are essential for controlling muscle contraction. NMJ dysfunction causes motor disorders, muscle wasting, and even breathing difficulties. Increasing evidence suggests that many NMJ disorders are closely related to alterations in specific gene products that are highly concentrated in the synaptic region of the muscle. However, many of these proteins are still undiscovered. Thus, screening for NMJ-specific proteins is essential for studying NMJ and the pathogenesis of NMJ diseases.ResultsIn this study, synaptic regions (SRs) and nonsynaptic regions (NSRs) of diaphragm samples from newborn (P0) and adult (3-month-old) mice were used for RNA-seq. A total of 92 and 182 genes were identified as differentially expressed between the SR and NSR in newborn and adult mice, respectively. Meanwhile, a total of 1563 genes were identified as differentially expressed between the newborn SR and adult SR. Gene Ontology (GO) enrichment analyses, Kyoto Encyclopedia of Genes and Genomes (KEGG) analysis and gene set enrichment analysis (GSEA) of the DEGs were performed. Protein-protein interaction (PPI) networks were constructed using STRING and Cytoscape. Further analysis identified some novel proteins and pathways that may be important for NMJ development, maintenance and maturation. Specifically, Sv2b, Ptgir, Gabrb3, P2rx3, Dlgap1 and Rims1 may play roles in NMJ development. Hcn1 may localize to the muscle membrane to regulate NMJ maintenance. Trim63, Fbxo32 and several Asb family proteins may regulate muscle developmental-related processes.ConclusionHere, we present a complete dataset describing the spatiotemporal transcriptome changes in synaptic genes and important synaptic pathways. The neuronal projection-related pathway, ion channel activity and neuroactive ligand-receptor interaction pathway are important for NMJ development. The myelination and voltage-gated ion channel activity pathway may be important for NMJ maintenance. These data will facilitate the understanding of the molecular mechanisms underlying the development and maintenance of NMJ and the pathogenesis of NMJ disorders.

Dataset Information

Finding the active genes in deep RNA-seq gene expression studies.

Publications

Finding the active genes in deep RNA-seq gene expression studies.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets