Project description:Single-cell RNA-seq data contain a large proportion of zeros for expressed genes. Such dropout events present a fundamental challenge for various types of data analyses. Here, we describe the SCRABBLE algorithm to address this problem. SCRABBLE leverages bulk data as a constraint and reduces unwanted bias towards expressed genes during imputation. Using both simulation and several types of experimental data, we demonstrate that SCRABBLE outperforms the existing methods in recovering dropout events, capturing true distribution of gene expression across cells, and preserving gene-gene relationship and cell-cell relationship in the data.
Project description:Acute myeloid leukemia (AML) is a heterogeneous disease that resides within a complex microenvironment, complicating efforts to understand how different cell types contribute to disease progression. We combined single-cell RNA sequencing and genotyping to profile 38,410 cells from 40 bone marrow aspirates, including 16 AML patients and five healthy donors. We then applied a machine learning classifier to distinguish a spectrum of malignant cell types whose abundances varied between patients and between subclones in the same tumor. Cell type compositions correlated with prototypic genetic lesions, including an association of FLT3-ITD with abundant progenitor-like cells. Primitive AML cells exhibited dysregulated transcriptional programs with co-expression of stemness and myeloid priming genes and had prognostic significance. Differentiated monocyte-like AML cells expressed diverse immunomodulatory genes and suppressed T cell activity in vitro. In conclusion, we provide single-cell technologies and an atlas of AML cell states, regulators, and markers with implications for precision medicine and immune therapies. VIDEO ABSTRACT.
Project description:Tumors are comprised of subpopulations of cancer cells that harbor distinct genetic profiles and phenotypes that evolve over time and during treatment. By reconstructing the course of cancer evolution, we can understand the acquisition of the malignant properties that drive tumor progression. Unfortunately, recovering the evolutionary relationships of individual cancer cells linked to their phenotypes remains a difficult challenge. To address this need, we have developed PhylinSic, a method that reconstructs the phylogenetic relationships among cells linked to their gene expression profiles from single cell RNA-sequencing (scRNA-Seq) data. This method calls nucleotide bases using a probabilistic smoothing approach and then estimates a phylogenetic tree using a Bayesian modeling algorithm. We showed that PhylinSic identified evolutionary relationships underpinning drug selection and metastasis and was sensitive enough to identify subclones from genetic drift. We found that breast cancer tumors resistant to chemotherapies harbored multiple genetic lineages that independently acquired high K-Ras and β-catenin, suggesting that therapeutic strategies may need to control multiple lineages to be durable. These results demonstrated that PhylinSic can reconstruct evolution and link the genotypes and phenotypes of cells across monophyletic tumors using scRNA-Seq.
Project description:Estimating cell type compositions for complex diseases is an important step to investigate the cellular heterogeneity for understanding disease etiology and potentially facilitate early disease diagnosis and prevention. Here, we developed a computationally statistical method, referring to Multi-Omics Matrix Factorization (MOMF), to estimate the cell-type compositions of bulk RNA sequencing (RNA-seq) data by leveraging cell type-specific gene expression levels from single-cell RNA sequencing (scRNA-seq) data. MOMF not only directly models the count nature of gene expression data, but also effectively accounts for the uncertainty of cell type-specific mean gene expression levels. We demonstrate the benefits of MOMF through three real data applications, i.e., Glioblastomas (GBM), colorectal cancer (CRC) and type II diabetes (T2D) studies. MOMF is able to accurately estimate disease-related cell type proportions, i.e., oligodendrocyte progenitor cells and macrophage cells, which are strongly associated with the survival of GBM and CRC, respectively.
Project description:Determination of haplotypes is important for modelling the phenotypic consequences of genetic variation in diploid organisms, including cis-regulatory control and compound heterozygosity. We realized that single-cell RNA-seq (scRNA-seq) data are well suited for phasing genetic variants, since both transcriptional bursts and technical bottlenecks cause pronounced allelic fluctuations in individual single cells. Here we present scphaser, an R package that phases alleles at heterozygous variants to reconstruct haplotypes within transcribed regions of the genome using scRNA-seq data. The devised method efficiently and accurately reconstructed the known haplotype for??93% of phasable genes in both human and mouse. It also enables phasing of rare and de novo variants and variants far apart within genes, which is hard to attain with population-based computational inference.scphaser is implemented as an R package. Tutorial and code are available at https://github.com/edsgard/scphaserrickard.sandberg@ki.seSupplementary data are available at Bioinformatics online.
Project description:Single-cell RNA-seq enables the quantitative characterization of cell types based on global transcriptome profiles. We present single-cell consensus clustering (SC3), a user-friendly tool for unsupervised clustering, which achieves high accuracy and robustness by combining multiple clustering solutions through a consensus approach (http://bioconductor.org/packages/SC3). We demonstrate that SC3 is capable of identifying subclones from the transcriptomes of neoplastic cells collected from patients.
Project description:UnlabelledAnalysis of the composition of heterogeneous tissue has been greatly enabled by recent developments in single-cell transcriptomics. We present SCell, an integrated software tool for quality filtering, normalization, feature selection, iterative dimensionality reduction, clustering and the estimation of gene-expression gradients from large ensembles of single-cell RNA-seq datasets. SCell is open source, and implemented with an intuitive graphical interface. Scripts and protocols for the high-throughput pre-processing of large ensembles of single-cell, RNA-seq datasets are provided as an additional resource.Availability and implementationBinary executables for Windows, MacOS and Linux are available at http://sourceforge.net/projects/scell, source code and pre-processing scripts are available from https://github.com/diazlab/SCellSupplementary information: Supplementary data are available at Bioinformatics online.Contactaaron.diaz@ucsf.edu.
Project description:The count table, a numeric matrix of genes × cells, is the basic input data structure in the analysis of single-cell RNA-sequencing data. A common preprocessing step is to adjust the counts for variable sampling efficiency and to transform them so that the variance is similar across the dynamic range. These steps are intended to make subsequent application of generic statistical methods more palatable. Here, we describe four transformation approaches based on the delta method, model residuals, inferred latent expression state and factor analysis. We compare their strengths and weaknesses and find that the latter three have appealing theoretical properties; however, in benchmarks using simulated and real-world data, it turns out that a rather simple approach, namely, the logarithm with a pseudo-count followed by principal-component analysis, performs as well or better than the more sophisticated alternatives. This result highlights limitations of current theoretical analysis as assessed by bottom-line performance benchmarks.
Project description:A key challenge in analyzing single cell RNA-sequencing data is the large number of false zeros, where genes actually expressed in a given cell are incorrectly measured as unexpressed. We present a method based on low-rank matrix approximation which imputes these values while preserving biologically non-expressed genes (true biological zeros) at zero expression levels. We provide theoretical justification for this denoising approach and demonstrate its advantages relative to other methods on simulated and biological datasets.
Project description:Dimensionality reduction is an important first step in the analysis of single-cell RNA-sequencing (scRNA-seq) data. In addition to enabling the visualization of the profiled cells, such representations are used by many downstream analyses methods ranging from pseudo-time reconstruction to clustering to alignment of scRNA-seq data from different experiments, platforms, and laboratories. Both supervised and unsupervised methods have been proposed to reduce the dimension of scRNA-seq. However, all methods to date are sensitive to batch effects. When batches correlate with cell types, as is often the case, their impact can lead to representations that are batch rather than cell-type specific. To overcome this, we developed a domain adversarial neural network model for learning a reduced dimension representation of scRNA-seq data. The adversarial model tries to simultaneously optimize two objectives. The first is the accuracy of cell-type assignment and the second is the inability to distinguish the batch (domain). We tested the method by using the resulting representation to align several different data sets. As we show, by overcoming batch effects our method was able to correctly separate cell types, improving on several prior methods suggested for this task. Analysis of the top features used by the network indicates that by taking the batch impact into account, the reduced representation is much better able to focus on key genes for each cell type.