Project description:An underlying question for virtually all single-cell RNA sequencing experiments is how to allocate the limited sequencing budget: deep sequencing of a few cells or shallow sequencing of many cells? Here we present a mathematical framework which reveals that, for estimating many important gene properties, the optimal allocation is to sequence at a depth of around one read per cell per gene. Interestingly, the corresponding optimal estimator is not the widely-used plug-in estimator, but one developed via empirical Bayes.
Project description:BackgroundDirect cDNA preamplification protocols developed for single-cell RNA-seq have enabled transcriptome profiling of precious clinical samples and rare cell populations without the need for sample pooling or RNA extraction. We term the use of single-cell chemistries for sequencing low numbers of cells limiting-cell RNA-seq (lcRNA-seq). Currently, there is no customized algorithm to select robust/low-noise transcripts from lcRNA-seq data for between-group comparisons.MethodsHerein, we present CLEAR, a workflow that identifies reliably quantifiable transcripts in lcRNA-seq data for differentially expressed genes (DEG) analysis. Total RNA obtained from primary chronic lymphocytic leukemia (CLL) CD5+ and CD5- cells were used to develop the CLEAR algorithm. Once established, the performance of CLEAR was evaluated with FACS-sorted cells enriched from mouse Dentate Gyrus (DG).ResultsWhen using CLEAR transcripts vs. using all transcripts in CLL samples, downstream analyses revealed a higher proportion of shared transcripts across three input amounts and improved principal component analysis (PCA) separation of the two cell types. In mouse DG samples, CLEAR identifies noisy transcripts and their removal improves PCA separation of the anticipated cell populations. In addition, CLEAR was applied to two publicly-available datasets to demonstrate its utility in lcRNA-seq data from other institutions. If imputation is applied to limit the effect of missing data points, CLEAR can also be used in large clinical trials and in single cell studies.ConclusionslcRNA-seq coupled with CLEAR is widely used in our institution for profiling immune cells (circulating or tissue-infiltrating) for its transcript preservation characteristics. CLEAR fills an important niche in pre-processing lcRNA-seq data to facilitate transcriptome profiling and DEG analysis. We demonstrate the utility of CLEAR in analyzing rare cell populations in clinical samples and in murine neural DG region without sample pooling.
Project description:Single-cell RNA-seq data contain a large proportion of zeros for expressed genes. Such dropout events present a fundamental challenge for various types of data analyses. Here, we describe the SCRABBLE algorithm to address this problem. SCRABBLE leverages bulk data as a constraint and reduces unwanted bias towards expressed genes during imputation. Using both simulation and several types of experimental data, we demonstrate that SCRABBLE outperforms the existing methods in recovering dropout events, capturing true distribution of gene expression across cells, and preserving gene-gene relationship and cell-cell relationship in the data.
Project description:We have assessed the performance of seven normalization methods for single cell RNA-seq using data generated from dilution of RNA samples. Our analyses showed that methods considering spike-in External RNA Control Consortium (ERCC) RNA molecules significantly outperformed those not considering ERCCs. This work provides a guidance of selecting normalization methods to remove technical noise in single cell RNA-seq data.
Project description:We report transcriptomes from 430 single glioblastoma cells isolated from 5 individual tumors and 102 single cells from gliomasphere cells lines generated using SMART-seq. In addition, we report population RNA-seq from the five tumors as well as RNA-seq from cell lines derived from 3 tumors (MGH26, MGH28, MGH31) cultured under serum free (CSC) and differentiated (FCS) conditions. This dataset highlights intratumoral heterogeneity with regards to the expression of de novo derived transcriptional modules and established subtype classifiers. Operative specimens from five glioblastoma patients (MGH26, MGH28, MGH29, MGH30, MGH31) were acutely dissociated, depleted for CD45+ inflammatory cells and then sorted as single cells (576 samples). Population controls for each tumor were isolated by sorting 2000-10000 cells and processed in parallel (5 population control samples). Single cells from two established cell lines, GBM6 and GBM8, were also sorted as single cells (192 samples). SMART-seq protocol was implemented to generate single cell full length transcriptomes (modified from Shalek, et al Nature 2013) and sequenced using 25 bp paired end reads. Single cell cDNA libraries for MGH30 were resequenced using 100 bp paired end reads to allow for isoform and splice junction reconstruction (96 samples, annotated MGH30L). Cells were also cultured in serum free conditions to generate gliomasphere cell lines for MGH26, MGH28, and MGH31 (CSC) which were then differentiated using 10% serum (FCS). Population RNA-seq was performed on these samples (3 CSC, 3 FCS, 6 total). The initial dataset included 875 RNA-seq libraries (576 single glioblastoma cells, 96 resequenced MGH30L, 192 single gliomasphere cells, 5 tumor population controls, 6 population libraries from CSC and FCS samples). Data was processed as described below using RSEM for quantification of gene expression. 5,948 genes with the highest composite expression either across all single cells combined (average log2(TPM)>4.5) or within a single tumor (average log2(TPM)>6 in at least one tumor) were included. Cells expressing less than 2,000 of these 5,948 genes were excluded. The final processed dataset then included 430 primary single cell glioblastoma transcriptomes, 102 single cell transcriptomes from cell lines(GBM6,GBM8), 5 population controls (1 for each tumor), and 6 population libraries from cell lines derived from the tumors (CSC and FCS for MGH26, MGH28 and MGH31). The final matrix (GBM_data_matrix.txt) therefore contains 5948 rows (genes) quantified in 543 samples (columns). Please note that the samples which are not included in the data processing are indicated in the sample description field.
Project description:RNA-seq of a single Purkinje cell from an 8 month old male wild type C57Bl6 mouse For data usage terms and conditions, please refer to http://www.genome.gov/27528022 and http://www.genome.gov/Pages/Research/ENCODE/ENCODE_Data_Use_Policy_for_External_Users_03-07-14.pdf