Project description:Droplet-based single cell transcriptome sequencing (scRNA-seq) technology is able to measure the gene expression from tens of thousands of single cells simultaneously. More recently, coupled with the cutting-edge Cellular Indexing of Transcriptomes and Epitopes by Sequencing (CITE-seq), the droplet-based system has allowed for immunophenotyping of single cells based on cell surface expression of specific proteins together with simultaneous transcriptome profiling in the same cell. In this study, we developed BREM-SC, a novel Bayesian Random Effects Mixture model that jointly clusters paired single cell transcriptomic and proteomic data, which will greatly facilitate researchers to jointly study transcriptome and surface proteins at the single cell level to make new biological discoveries.
Project description:Droplet-based single cell transcriptome sequencing (scRNA-seq) technology, largely represented by the 10× Genomics Chromium system, is able to measure the gene expression from tens of thousands of single cells simultaneously. More recently, coupled with the cutting-edge Cellular Indexing of Transcriptomes and Epitopes by Sequencing (CITE-seq), the droplet-based system has allowed for immunophenotyping of single cells based on cell surface expression of specific proteins together with simultaneous transcriptome profiling in the same cell. Despite the rapid advances in technologies, novel statistical methods and computational tools for analyzing multi-modal CITE-Seq data are lacking. In this study, we developed BREM-SC, a novel Bayesian Random Effects Mixture model that jointly clusters paired single cell transcriptomic and proteomic data. Through simulation studies and analysis of public and in-house real data sets, we successfully demonstrated the validity and advantages of this method in fully utilizing both types of data to accurately identify cell clusters. In addition, as a probabilistic model-based approach, BREM-SC is able to quantify the clustering uncertainty for each single cell. This new method will greatly facilitate researchers to jointly study transcriptome and surface proteins at the single cell level to make new biological discoveries, particularly in the area of immunology.
Project description:Abstract: The recently developed droplet-based single cell transcriptome sequencing (scRNA-seq) technology makes it feasible to perform a population-scale scRNA-seq study, in which the transcriptome is measured for tens of thousands of single cells from multiple individuals. Despite the advances of many clustering methods, there are few tailored methods for population-scale scRNA-seq studies. Here, we develop a BAyesian Mixture Model for Single Cell sequencing (BAMM-SC) method to cluster scRNA-seq data from multiple individuals simultaneously. BAMM-SC takes raw count data as input and accounts for data heterogeneity and batch effect among multiple individuals in a unified Bayesian hierarchical model framework. Results from applications of BAMM-SC to in-house experimental scRNA-seq datasets using blood and lung cells from humans or mice demonstrate that BAMM-SC outperformed existing clustering methods with considerable improved clustering accuracy, particularly in the presence of heterogeneity among individuals. Data purpose: To evaluate the performance of BAMM-SC for clustering droplet-based scRNA-seq data in population-based study, we performed single cell RNA-seq on peripheral blood mononuclear cells (PBMC) isolated from whole blood obtained from 4 healthy donors, and on lung cells isolated from streptococcus pneumonia (SP) infected and naïve mice.
Project description:Histone modifications are a key epigenetic mechanism to activate or repress the expression of genes. Data sets of matched microarray expression data and histone modification data measured by ChIP-seq exist, but methods for integrative analysis of both data types are still rare. Here, we present a novel bioinformatic approach to detect genes that are differentially expressed between two conditions putatively caused by alterations in histone modification. We introduce a correlation measure for integrative analysis of ChIP-seq and gene expression data and demonstrate that a proper normalization of the ChIP-seq data is crucial. We suggest applying Bayesian mixture models of different distributions to further study the distribution of the correlation measure. The implicit classification of the mixture models is used to detect genes with differences between two conditions in both gene expression and histone modification. The method is applied to different data sets and its superiority to a naive separate analysis of both data types is demonstrated. This GEO series contains the expression data of the Cebpa example data set.
Project description:Histone modifications are a key epigenetic mechanism to activate or repress the expression of genes. Data sets of matched microarray expression data and histone modification data measured by ChIP-seq exist, but methods for integrative analysis of both data types are still rare. Here, we present a novel bioinformatic approach to detect genes that are differentially expressed between two conditions putatively caused by alterations in histone modification. We introduce a correlation measure for integrative analysis of ChIP-seq and gene expression data and demonstrate that a proper normalization of the ChIP-seq data is crucial. We suggest applying Bayesian mixture models of different distributions to further study the distribution of the correlation measure. The implicit classification of the mixture models is used to detect genes with differences between two conditions in both gene expression and histone modification. The method is applied to different data sets and its superiority to a naive separate analysis of both data types is demonstrated. This GEO series contains the expression data of the Cebpa example data set. This data set was derived from sorted Cebpafl/fl and Cebpafl/fl;Mx1Cre murine hematopoietic LSKCD150- 18 post pIpC injections (conditional deletion of Cebpa). The specimens from three Cebpafl/fl and three Cebpafl/fl;Mx1Cre mice were hybridized separately on six Affymetrix Mouse Gene 1.0 ST arrays. Associated histone modification ChIP-seq data is provided by series GSE43007.
Project description:BackgroundCluster analysis is an integral part of precision medicine and systems biology, used to define groups of patients or biomolecules. Consensus clustering is an ensemble approach that is widely used in these areas, which combines the output from multiple runs of a non-deterministic clustering algorithm. Here we consider the application of consensus clustering to a broad class of heuristic clustering algorithms that can be derived from Bayesian mixture models (and extensions thereof) by adopting an early stopping criterion when performing sampling-based inference for these models. While the resulting approach is non-Bayesian, it inherits the usual benefits of consensus clustering, particularly in terms of computational scalability and providing assessments of clustering stability/robustness.ResultsIn simulation studies, we show that our approach can successfully uncover the target clustering structure, while also exploring different plausible clusterings of the data. We show that, when a parallel computation environment is available, our approach offers significant reductions in runtime compared to performing sampling-based Bayesian inference for the underlying model, while retaining many of the practical benefits of the Bayesian approach, such as exploring different numbers of clusters. We propose a heuristic to decide upon ensemble size and the early stopping criterion, and then apply consensus clustering to a clustering algorithm derived from a Bayesian integrative clustering method. We use the resulting approach to perform an integrative analysis of three 'omics datasets for budding yeast and find clusters of co-expressed genes with shared regulatory proteins. We validate these clusters using data external to the analysis.ConclustionsOur approach can be used as a wrapper for essentially any existing sampling-based Bayesian clustering implementation, and enables meaningful clustering analyses to be performed using such implementations, even when computational Bayesian inference is not feasible, e.g. due to poor exploration of the target density (often as a result of increasing numbers of features) or a limited computational budget that does not along sufficient samples to drawn from a single chain. This enables researchers to straightforwardly extend the applicability of existing software to much larger datasets, including implementations of sophisticated models such as those that jointly model multiple datasets.
Project description:Spatial transcriptomics has been emerging as a powerful technique for resolving gene expression profiles while retaining tissue spatial information. These spatially resolved transcriptomics make it feasible to examine the complex multicellular systems of different microenvironments. To answer scientific questions with spatial transcriptomics and expand our understanding of how cell types and states are regulated by microenvironment, the first step is to identify cell clusters by integrating the available spatial information. Here, we introduce SC-MEB, an empirical Bayes approach for spatial clustering analysis using a hidden Markov random field. We have also derived an efficient expectation-maximization algorithm based on an iterative conditional mode for SC-MEB. In contrast to BayesSpace, a recently developed method, SC-MEB is not only computationally efficient and scalable to large sample sizes but is also capable of choosing the smoothness parameter and the number of clusters. We performed comprehensive simulation studies to demonstrate the superiority of SC-MEB over some existing methods. We applied SC-MEB to analyze the spatial transcriptome of human dorsolateral prefrontal cortex tissues and mouse hypothalamic preoptic region. Our analysis results showed that SC-MEB can achieve a similar or better clustering performance to BayesSpace, which uses the true number of clusters and a fixed smoothness parameter. Moreover, SC-MEB is scalable to large 'sample sizes'. We then employed SC-MEB to analyze a colon dataset from a patient with colorectal cancer (CRC) and COVID-19, and further performed differential expression analysis to identify signature genes related to the clustering results. The heatmap of identified signature genes showed that the clusters identified using SC-MEB were more separable than those obtained with BayesSpace. Using pathway analysis, we identified three immune-related clusters, and in a further comparison, found the mean expression of COVID-19 signature genes was greater in immune than non-immune regions of colon tissue. SC-MEB provides a valuable computational tool for investigating the structural organizations of tissues from spatial transcriptomic data.
Project description:MotivationSingle cell transcriptome sequencing (scRNA-Seq) has become a revolutionary tool to study cellular and molecular processes at single cell resolution. Among existing technologies, the recently developed droplet-based platform enables efficient parallel processing of thousands of single cells with direct counting of transcript copies using Unique Molecular Identifier (UMI). Despite the technology advances, statistical methods and computational tools are still lacking for analyzing droplet-based scRNA-Seq data. Particularly, model-based approaches for clustering large-scale single cell transcriptomic data are still under-explored.ResultsWe developed DIMM-SC, a Dirichlet Mixture Model for clustering droplet-based Single Cell transcriptomic data. This approach explicitly models UMI count data from scRNA-Seq experiments and characterizes variations across different cell clusters via a Dirichlet mixture prior. We performed comprehensive simulations to evaluate DIMM-SC and compared it with existing clustering methods such as K-means, CellTree and Seurat. In addition, we analyzed public scRNA-Seq datasets with known cluster labels and in-house scRNA-Seq datasets from a study of systemic sclerosis with prior biological knowledge to benchmark and validate DIMM-SC. Both simulation studies and real data applications demonstrated that overall, DIMM-SC achieves substantially improved clustering accuracy and much lower clustering variability compared to other existing clustering methods. More importantly, as a model-based approach, DIMM-SC is able to quantify the clustering uncertainty for each single cell, facilitating rigorous statistical inference and biological interpretations, which are typically unavailable from existing clustering methods.Availability and implementationDIMM-SC has been implemented in a user-friendly R package with a detailed tutorial available on www.pitt.edu/∼wec47/singlecell.html.Contactwei.chen@chp.edu or hum@ccf.org.Supplementary informationSupplementary data are available at Bioinformatics online.
Project description:Piecewise growth mixture models are a flexible and useful class of methods for analyzing segmented trends in individual growth trajectory over time, where the individuals come from a mixture of two or more latent classes. These models allow each segment of the overall developmental process within each class to have a different functional form; examples include two linear phases of growth, or a quadratic phase followed by a linear phase. The changepoint (knot) is the time of transition from one developmental phase (segment) to another. Inferring the location of the changepoint(s) is often of practical interest, along with inference for other model parameters. A random changepoint allows for individual differences in the transition time within each class. The primary objectives of our study are as follows: (1) to develop a PGMM using a Bayesian inference approach that allows the estimation of multiple random changepoints within each class; (2) to develop a procedure to empirically detect the number of random changepoints within each class; and (3) to empirically investigate the bias and precision of the estimation of the model parameters, including the random changepoints, via a simulation study. We have developed the user-friendly package BayesianPGMM for R to facilitate the adoption of this methodology in practice, which is available at https://github.com/lockEF/BayesianPGMM . We describe an application to mouse-tracking data for a visual recognition task.