Dataset Information

Comparison and evaluation of statistical error models for scRNA-seq.

ABSTRACT:

Background

Heterogeneity in single-cell RNA-seq (scRNA-seq) data is driven by multiple sources, including biological variation in cellular state as well as technical variation introduced during experimental processing. Deconvolving these effects is a key challenge for preprocessing workflows. Recent work has demonstrated the importance and utility of count models for scRNA-seq analysis, but there is a lack of consensus on which statistical distributions and parameter settings are appropriate.

Results

Here, we analyze 59 scRNA-seq datasets that span a wide range of technologies, systems, and sequencing depths in order to evaluate the performance of different error models. We find that while a Poisson error model appears appropriate for sparse datasets, we observe clear evidence of overdispersion for genes with sufficient sequencing depth in all biological systems, necessitating the use of a negative binomial model. Moreover, we find that the degree of overdispersion varies widely across datasets, systems, and gene abundances, and argues for a data-driven approach for parameter estimation.

Conclusions

Based on these analyses, we provide a set of recommendations for modeling variation in scRNA-seq data, particularly when using generalized linear models or likelihood-based approaches for preprocessing and downstream analysis.

SUBMITTER: Choudhary S

PROVIDER: S-EPMC8764781 | biostudies-literature | 2022 Jan

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Comparison and evaluation of statistical error models for scRNA-seq.

Choudhary Saket S Satija Rahul R

Genome biology 20220118 1

<h4>Background</h4>Heterogeneity in single-cell RNA-seq (scRNA-seq) data is driven by multiple sources, including biological variation in cellular state as well as technical variation introduced during experimental processing. Deconvolving these effects is a key challenge for preprocessing workflows. Recent work has demonstrated the importance and utility of count models for scRNA-seq analysis, but there is a lack of consensus on which statistical distributions and parameter settings are appropr ...[more]

PMID: 35042561

Similar Datasets

Project description:Lung adenocarcinoma (LUAD) is one of the sole causes of death in lung cancer patients. This study combined with single-cell RNA-seq analysis to identify tumor stem-related prognostic models to predict the prognosis of lung adenocarcinoma, chemotherapy agents, and immunotherapy efficacy. mRNA expression-based stemness index (mRNAsi) was determined by One Class Linear Regression (OCLR). Differentially expressed genes (DEGs) were detected by limma package. Single-cell RNA-seq analysis in GSE123902 dataset was performed using Seurat package. Weighted Co-Expression Network Analysis (WGCNA) was built by rms package. Cell differentiation ability was determined by CytoTRACE. Cell communication analysis was performed by CellCall and CellChat package. Prognosis model was constructed by 10 machine learning and 101 combinations. Drug predictive analysis was conducted by pRRophetic package. Immune microenvironment landscape was determined by ESTIMATE, MCP-Counter, ssGSEA analysis. Tumor samples have higher mRNAsi, and the high mRNAsi group presents a worse prognosis. Turquoise module was highly correlated with mRNAsi in TCGA-LUAD dataset. scRNA analysis showed that 22 epithelial cell clusters were obtained, and higher CSCs malignant epithelial cells have more complex cellular communication with other cells and presented dedifferentiation phenomenon. Cellular senescence and Hippo signaling pathway are the major difference pathways between high- and low CSCs malignant epithelial cells. The pseudo-temporal analysis shows that cluster1, 2, high CSC epithelial cells, are concentrated at the end of the differentiation trajectory. Finally, 13 genes were obtained by intersecting genes in turquoise module, Top200 genes in hdWGCNA, DEGs in high- and low- mRNAsi group as well as DEGs in tumor samples vs. normal group. Among 101 prognostic models, average c-index (0.71) was highest in CoxBoost + RSF model. The high-risk group samples had immunosuppressive status, higher tumor malignancy and low benefit from immunotherapy. This work found that malignant tumors and malignant epithelial cells have high CSC characteristics, and identified a model that could predict the prognosis, immune microenvironment, and immunotherapy of LUAD, based on CSC-related genes. These results provided reference value for the clinical diagnosis and treatment of LUAD.

Project description:BackgroundBreast cancer (BC) is the most common malignancy in women with high heterogeneity. The heterogeneity of cancer cells from different BC subtypes has not been thoroughly characterized and there is still no valid biomarker for predicting the prognosis of BC patients in clinical practice.MethodsCancer cells were identified by calculating single cell copy number variation using the inferCNV algorithm. SCENIC was utilized to infer gene regulatory networks. CellPhoneDB software was used to analyze the intercellular communications in different cell types. Survival analysis, univariate Cox, least absolute shrinkage and selection operator (LASSO) regression and multivariate Cox analysis were used to construct subtype specific prognostic models.ResultsTriple-negative breast cancer (TNBC) has a higher proportion of cancer cells than subtypes of HER2+ BC and luminal BC, and the specifically upregulated genes of the TNBC subtype are associated with antioxidant and chemical stress resistance. Key transcription factors (TFs) of tumor cells for three subtypes varied, and most of the TF-target genes are specifically upregulated in corresponding BC subtypes. The intercellular communications mediated by different receptor-ligand pairs lead to an inflammatory response with different degrees in the three BC subtypes. We establish a prognostic model containing 10 genes (risk genes: ATP6AP1, RNF139, BASP1, ESR1 and TSKU; protective genes: RPL31, PAK1, STARD10, TFPI2 and SIAH2) for luminal BC, seven genes (risk genes: ACTR6 and C2orf76; protective genes: DIO2, DCXR, NDUFA8, SULT1A2 and AQP3) for HER2+ BC, and seven genes (risk genes: HPGD, CDC42 and PGK1; protective genes: SMYD3, LMO4, FABP7 and PRKRA) for TNBC. Three prognostic models can distinguish high-risk patients from low-risk patients and accurately predict patient prognosis.ConclusionsComparative analysis of the three BC subtypes based on cancer cell heterogeneity in this study will be of great clinical significance for the diagnosis, prognosis and targeted therapy for BC patients.

Dataset Information

Comparison and evaluation of statistical error models for scRNA-seq.

Background

Results

Conclusions

Publications

Comparison and evaluation of statistical error models for scRNA-seq.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets