Dataset Information

Scalable preprocessing for sparse scRNA-seq data exploiting prior knowledge.

ABSTRACT: Single cell RNA-seq (scRNA-seq) data contains a wealth of information which has to be inferred computationally from the observed sequencing reads. As the ability to sequence more cells improves rapidly, existing computational tools suffer from three problems. (i) The decreased reads-per-cell implies a highly sparse sample of the true cellular transcriptome. (ii) Many tools simply cannot handle the size of the resulting datasets. (iii) Prior biological knowledge such as bulk RNA-seq information of certain cell types or qualitative marker information is not taken into account. Here we present UNCURL, a preprocessing framework based on non-negative matrix factorization for scRNA-seq data, that is able to handle varying sampling distributions, scales to very large cell numbers and can incorporate prior knowledge.We find that preprocessing using UNCURL consistently improves performance of commonly used scRNA-seq tools for clustering, visualization and lineage estimation, both in the absence and presence of prior knowledge. Finally we demonstrate that UNCURL is extremely scalable and parallelizable, and runs faster than other methods on a scRNA-seq dataset containing 1.3 million cells.Source code is available at https://github.com/yjzhang/uncurl_python.Supplementary data are available at Bioinformatics online.

SUBMITTER: Mukherjee S

PROVIDER: S-EPMC6022691 | biostudies-other | 2018 Jul

REPOSITORIES: biostudies-other

ACCESS DATA

Publications

Scalable preprocessing for sparse scRNA-seq data exploiting prior knowledge.

Mukherjee Sumit S Zhang Yue Y Fan Joshua J Seelig Georg G Kannan Sreeram S

Bioinformatics (Oxford, England) 20180701 13

<h4>Motivation</h4>Single cell RNA-seq (scRNA-seq) data contains a wealth of information which has to be inferred computationally from the observed sequencing reads. As the ability to sequence more cells improves rapidly, existing computational tools suffer from three problems. (i) The decreased reads-per-cell implies a highly sparse sample of the true cellular transcriptome. (ii) Many tools simply cannot handle the size of the resulting datasets. (iii) Prior biological knowledge such as bulk RN ...[more]

PMID: 29949988

Dataset Information

Scalable preprocessing for sparse scRNA-seq data exploiting prior knowledge.

Publications

Scalable preprocessing for sparse scRNA-seq data exploiting prior knowledge.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Similar Datasets

Preprocessing choices affect RNA velocity results for droplet scRNA-seq data.
| S-EPMC7822509 | biostudies-literature

Ultra-fast scalable estimation of single-cell differentiation potency from scRNA-Seq data.
| S-EPMC8275983 | biostudies-literature

RCA2: a scalable supervised clustering algorithm that reduces batch effects in scRNA-seq data.
| S-EPMC8344557 | biostudies-literature

Optimization of miRNA-seq data preprocessing.
| S-EPMC4652620 | biostudies-literature

Cerebro: interactive visualization of scRNA-seq data.
| S-EPMC7141853 | biostudies-literature

An Experiment on Ab Initio Discovery of Biological Knowledge from scRNA-Seq Data Using Machine Learning.
| S-EPMC7660369 | biostudies-literature

Contrastive self-supervised clustering of scRNA-seq data.
| S-EPMC8157426 | biostudies-literature

FastqPuri: high-performance preprocessing of RNA-seq data.
| S-EPMC6500068 | biostudies-literature

dropClust: efficient clustering of ultra-large scRNA-seq data.
| S-EPMC5888655 | biostudies-literature

Iterative point set registration for aligning scRNA-seq data.
| S-EPMC7647120 | biostudies-literature