Project description:At present, it is admitted that RNA-seq is a more powerful and adaptable technique than hybridization arrays. Nevertheless, as RNA-seq needs a more complex data analysis, it has generated a lot of research on algorithms and workflows. This has resulted in an exponential increase of the options at each step of the analysis. Consequently, there is no clear consensus on the appropriate algorithms and pipelines that should be used to analyse RNA-seq data. In the present study, 192 pipelines on 18 samples from 2 human cell lines were evaluated. Absolute gene expression quantification was assessed by non-parametric statistics to measure precision and accuracy. Relative gene expression performance was estimated testing 19 differential expression methods. These results were contrasted in parallel with the microarray HTA 2.0 data from Affymetrix using the same set of samples. All procedures were validated by qRT-PCR on 32 genes in all samples. In addition, this study proposes a new statistical approach for precision and accuracy evaluation on real RNA-seq data. It also weights up the advantages and disadvantages of the algorithms and pipelines tested and gives a guide to select the appropriate pipeline to analyse RNA-seq and microarray data.
Project description:A large number of computational methods have been recently developed for analyzing differential gene expression (DE) in RNA-seq data. We report on a comprehensive evaluation of the commonly used DE methods using the SEQC benchmark data set and data from ENCODE project. We evaluated a number of key features including: normalization, accuracy of DE detection and DE analysis when one condition has no detectable expression. We found significant differences among the methods. Furthermore, computational methods designed for DE detection from expression array data perform comparably to methods customized for RNA-seq. Most importantly, our results demonstrate that increasing the number of replicate samples significantly improves detection power over increased sequencing depth. The Sequencing Quality Control Consortium generated two datasets from two reference RNA samples in order to evaluate transcriptome profiling by next-generation sequencing technology. Each sample contains one of the reference RNA source and a set of synthetic RNAs from the External RNA Control Consortium (ERCC) at known concentrations. Group A contains 5 replicates of the Strategene Universal Human Reference RNA (UHRR), which is composed of total RNA from 10 human cell lines, with 2% by volume of ERCC mix 1. Group B includes 5 replicate samples of the Ambion Human Brain Reference RNA (HBRR) with 2% by volume of ERCC mix 2. The ERCC spike-in control is a mixture of 92 synthetic polyadenylated oligonucleotides of 250-2000 nucleotides long that are meant to resemble human transcripts.
Project description:SnowShoes-FTD, a fusion transcript discovery tool, was used to identify fusions in breast cancer cell lines using the RNA-Seq data Total RNA extracted from cell lines. The total RNA was used for construction of RNA-Seq library for RNA-Sequencing.
Project description:The ubiquity of RNA-seq has led to many methods that use RNA-seq data to analyze variations in RNA splicing. However, available methods are not well suited for handling heterogeneous and large datasets. Such datasets scale to thousands of samples across dozens of experimental conditions, exhibit increased variability compared to biological replicates, and involve thousands of unannotated splice variants resulting in increased transcriptome complexity. We describe here a suite of algorithms and tools implemented in the MAJIQ v2 package to address challenges in detection, quantification, and visualization of splicing variations from such datasets. Here we created a large, realistic synthetic RNA-seq dataset of 150 simulated cerebellum samples and 150 skeletal muscle samples using BEERS. We use this as a benchmark dataset to assess the advantages of MAJIQ v2 compared to existing methods.
Project description:Here, we have collapsed multiple analysis problems into two coherent categories, signal detection and signal estimation and adapted linear-optimal solutions from signal processing theory. Our algorithms for detection (DFilter) and estimation (EFilter) extend naturally to integration of multiple datasets. In benchmarking tests, DFilter outperformed assay-specific algorithms at identifying promoters from histone ChIP-seq, binding sites from transcription factor (TF) ChIP-seq and open chromatin regions from DNase- and FAIRE-seq data. EFilter similarly outperformed an existing method for predicting mRNA levels from histone ChIP-seq data (Spearman correlation: 0.81 - 0.89). We performed H3K4me3 and H3K36me3 ChIP-seq on e11.5 mouse forebrain and used DFilter and EFilter to predict promoters and developmental gene expression, uncovering plausible gene targets for SNPs associated with neurodevelopmental disorders. Generated two histone modifiction ChiP-seq in developing embryo mouse forebrain and using them for making bioligical inferences
Project description:Data analysis is a critical part of quantitative proteomics studies in interpreting biological questions. Numerous computational tools including protein quantification, imputation, and differential expression (DE) analysis were generated in the past decade. However, searching optimized tools is still an unsolved issue. Moreover, due to the rapid development of RNA-Seq technology, a vast number of DE analysis methods are created. Applying these newly developed RNA-Seq-oriented tools to proteomics data is still a question that needs to be addressed. In order to benchmark these analysis methods, a proteomics dataset constituted the proteins derived from human, yeast, and drosophila with different ratios were generated. Based on this dataset, DE analysis tools (including array-based and RNA-Seq based), imputation algorithms, and protein quantification methods were compared and benchmarked. This study provided useful information on analyzing quantitative proteomics datasets. All the methods used in this study were integrated into Perseus which are available at https://www.maxquant.org/perseus.
Project description:Advantages of RNA-Seq over array based platforms are quantitative gene expression and discovery of expressed single nucleotide variants (eSNVs) and fusion transcripts from a single platform, but the sensitivity for each of these characteristics is unknown. We measured gene expression in a set of manually degraded RNAs, nine pairs of matched fresh-frozen, and FFPE RNA isolated from breast tumor with the hybridization based, NanoString nCounter, (226 gene panel) and with whole transcriptome RNA-Seq using RiboZeroGold ScriptSeq V2 library preparation kits. We performed correlation analyses of gene expression between samples and across platforms. We then specifically assessed whole transcriptome expression of lincRNA and discovery of eSNVs and fusion transcripts in the FFPE RNA-Seq data. For gene expression in the manually degraded samples, we observed Pearson correlation of >0.94 and >0.80 with NanoString and ScriptSeq protocols respectively. Gene expression data for matched fresh-frozen and FFPE samples yielded mean Pearson correlations of 0.874 and 0.783 for NanoString (226 genes) and ScriptSeq whole transcriptome protocols respectively. Specifically for lincRNAs, we observed superb Pearson correlation (0.988) between matched fresh-frozen and FFPE pairs. FFPE samples across NanoString and RNA-Seq platforms gave a mean Pearson correlation of 0.838. In FFPE libraries, we detected 53.4% of high confidence SNVs and 24% of high confidence fusion transcripts. Sensitivity of fusion transcript detection was not overcome by an increase in depth of sequencing up to 3-fold (increase from ~56 to ~159 million reads). Both NanoString and ScriptSeq RNA-Seq technologies yield reliable gene expression data for degraded and FFPE material. The high degree of correlation between NanoString and RNA-Seq platforms suggests discovery based whole transciptome studies from FFPE material will produce reliable expression data. The RiboZeroGold ScriptSeq protocol performed particularly well for lincRNA expression from FFPE libraries but detection of eSNV and fusion transcripts was less sensitive. We performed RNASeq on RNA from nine matched pairs of fresh-frozen and FFPE tissues from breast cancer patients. The goal was to test the RiboZeroGold ScriptSeq complete low input library preparation kit for degraded RNA samples.
Project description:Here, we have collapsed multiple analysis problems into two coherent categories, signal detection and signal estimation and adapted linear-optimal solutions from signal processing theory. Our algorithms for detection (DFilter) and estimation (EFilter) extend naturally to integration of multiple datasets. In benchmarking tests, DFilter outperformed assay-specific algorithms at identifying promoters from histone ChIP-seq, binding sites from transcription factor (TF) ChIP-seq and open chromatin regions from DNase- and FAIRE-seq data. EFilter similarly outperformed an existing method for predicting mRNA levels from histone ChIP-seq data (Spearman correlation: 0.81 - 0.89). We performed H3K4me3 and H3K36me3 ChIP-seq on e11.5 mouse forebrain and used DFilter and EFilter to predict promoters and developmental gene expression, uncovering plausible gene targets for SNPs associated with neurodevelopmental disorders.