Browse
Submit Data
Databases
API
Help

Dataset Information

20 Views

0 Connections

0 Citations

0 Reanalyses

0 Downloads

Omics score: 0

Transfer learning compensates limited data, batch effects and technological heterogeneity in single-cell sequencing.

ABSTRACT: Tremendous advances in next-generation sequencing technology have enabled the accumulation of large amounts of omics data in various research areas over the past decade. However, study limitations due to small sample sizes, especially in rare disease clinical research, technological heterogeneity and batch effects limit the applicability of traditional statistics and machine learning analysis. Here, we present a meta-transfer learning approach to transfer knowledge from big data and reduce the search space in data with small sample sizes. Few-shot learning algorithms integrate meta-learning to overcome data scarcity and data heterogeneity by transferring molecular pattern recognition models from datasets of unrelated domains. We explore few-shot learning models with large scale public dataset, TCGA (The Cancer Genome Atlas) and GTEx dataset, and demonstrate their potential as pre-training dataset in other molecular pattern recognition tasks. Our results show that meta-transfer learning is very effective for datasets with a limited sample size. Furthermore, we show that our approach can transfer knowledge across technological heterogeneity, for example, from bulk cell to single-cell data. Our approach can overcome study size constraints, batch effects and technical limitations in analyzing single-cell data by leveraging existing bulk-cell sequencing data.

SUBMITTER: Park Y

PROVIDER: S-EPMC8598306 | biostudies-literature |

REPOSITORIES: biostudies-literature

ACCESS DATA

Json Xml

Similar Datasets

scDisInFact: disentangled learning for integration and prediction of multi-batch multi-condition single-cell RNA-sequencing data.

Project description:Single-cell RNA-sequencing (scRNA-seq) has been widely used for disease studies, where sample batches are collected from donors under different conditions including demographical groups, disease stages, and drug treatments. It is worth noting that the differences among sample batches in such a study are a mixture of technical confounders caused by batch effect and the biological variations caused by condition effect. However, current batch effect removal methods often eliminate both technical batch effects and meaningful condition effects, while perturbation prediction methods solely focus on condition effects, resulting in inaccurate gene expression predictions due to unaccounted batch effects. Here we introduce scDisInFact, a deep learning framework that models both batch effect and condition effect in scRNA-seq data. scDisInFact learns latent factors that disentangle condition effects from batch effects, enabling it to simultaneously perform three tasks: batch effect removal, condition-associated key gene detection, and perturbation prediction. We evaluated scDisInFact on both simulated and real datasets, and compared its performance to baseline methods for each task. Our results demonstrate that scDisInFact outperforms existing methods that focus on individual tasks, providing a more comprehensive and accurate approach for integrating and predicting multi-batch multi-condition single-cell RNA-sequencing data.

| S-EPMC10187255 | biostudies-literature

scDisInFact: disentangled learning for integration and prediction of multi-batch multi-condition single-cell RNA-sequencing data.

Project description:Single-cell RNA-sequencing (scRNA-seq) has been widely used for disease studies, where sample batches are collected from donors under different conditions including demographic groups, disease stages, and drug treatments. It is worth noting that the differences among sample batches in such a study are a mixture of technical confounders caused by batch effect and biological variations caused by condition effect. However, current batch effect removal methods often eliminate both technical batch effect and meaningful condition effect, while perturbation prediction methods solely focus on condition effect, resulting in inaccurate gene expression predictions due to unaccounted batch effect. Here we introduce scDisInFact, a deep learning framework that models both batch effect and condition effect in scRNA-seq data. scDisInFact learns latent factors that disentangle condition effect from batch effect, enabling it to simultaneously perform three tasks: batch effect removal, condition-associated key gene detection, and perturbation prediction. We evaluate scDisInFact on both simulated and real datasets, and compare its performance with baseline methods for each task. Our results demonstrate that scDisInFact outperforms existing methods that focus on individual tasks, providing a more comprehensive and accurate approach for integrating and predicting multi-batch multi-condition single-cell RNA-sequencing data.

| S-EPMC10827746 | biostudies-literature

BERMUDA: a novel deep transfer learning method for single-cell RNA sequencing batch correction reveals hidden high-resolution cellular subtypes.

Project description:To fully utilize the power of single-cell RNA sequencing (scRNA-seq) technologies for identifying cell lineages and bona fide transcriptional signals, it is necessary to combine data from multiple experiments. We present BERMUDA (Batch Effect ReMoval Using Deep Autoencoders), a novel transfer-learning-based method for batch effect correction in scRNA-seq data. BERMUDA effectively combines different batches of scRNA-seq data with vastly different cell population compositions and amplifies biological signals by transferring information among batches. We demonstrate that BERMUDA outperforms existing methods for removing batch effects and distinguishing cell types in multiple simulated and real scRNA-seq datasets.

| S-EPMC6691531 | biostudies-literature

Transfer learning for versatile plant disease recognition with limited data.

Project description:Deep learning has witnessed a significant improvement in recent years to recognize plant diseases by observing their corresponding images. To have a decent performance, current deep learning models tend to require a large-scale dataset. However, collecting a dataset is expensive and time-consuming. Hence, the limited data is one of the main challenges to getting the desired recognition accuracy. Although transfer learning is heavily discussed and verified as an effective and efficient method to mitigate the challenge, most proposed methods focus on one or two specific datasets. In this paper, we propose a novel transfer learning strategy to have a high performance for versatile plant disease recognition, on multiple plant disease datasets. Our transfer learning strategy differs from the current popular one due to the following factors. First, PlantCLEF2022, a large-scale dataset related to plants with 2,885,052 images and 80,000 classes, is utilized to pre-train a model. Second, we adopt a vision transformer (ViT) model, instead of a convolution neural network. Third, the ViT model undergoes transfer learning twice to save computations. Fourth, the model is first pre-trained in ImageNet with a self-supervised loss function and with a supervised loss function in PlantCLEF2022. We apply our method to 12 plant disease datasets and the experimental results suggest that our method surpasses the popular one by a clear margin for different dataset settings. Specifically, our proposed method achieves a mean testing accuracy of 86.29over the 12 datasets in a 20-shot case, 12.76 higher than the current state-of-the-art method's accuracy of 73.53. Furthermore, our method outperforms other methods in one plant growth stage prediction and the one weed recognition dataset. To encourage the community and related applications, we have made public our codes and pre-trained model.

| S-EPMC9726777 | biostudies-literature

Predicting reaction conditions from limited data through active transfer learning.

Project description:Transfer and active learning have the potential to accelerate the development of new chemical reactions, using prior data and new experiments to inform models that adapt to the target area of interest. This article shows how specifically tuned machine learning models, based on random forest classifiers, can expand the applicability of Pd-catalyzed cross-coupling reactions to types of nucleophiles unknown to the model. First, model transfer is shown to be effective when reaction mechanisms and substrates are closely related, even when models are trained on relatively small numbers of data points. Then, a model simplification scheme is tested and found to provide comparative predictivity on reactions of new nucleophiles that include unseen reagent combinations. Lastly, for a challenging target where model transfer only provides a modest benefit over random selection, an active transfer learning strategy is introduced to improve model predictions. Simple models, composed of a small number of decision trees with limited depths, are crucial for securing generalizability, interpretability, and performance of active transfer learning.

| S-EPMC9172577 | biostudies-literature

Batch correction of single-cell sequencing data via an autoencoder architecture.

Project description:MotivationTechnical differences between gene expression sequencing experiments can cause variations in the data in the form of batch effect biases. These do not represent true biological variations between samples and can lead to false conclusions or hinder the ability to integrate multiple datasets. Since there is a growing need for the joint analysis of single-cell sequencing datasets from different sources, there is also a need to correct the resulting batch effects while maintaining the true biological variations in the data.ResultsWe developed a semi-supervised deep learning architecture called Autoencoder-based Batch Correction (ABC) for integrating single-cell sequencing datasets. Our method removes batch effects through a guided process of data compression using supervised cell type classifier branches for biological signal retention. It aligns the different batches using an adversarial training approach. We comprehensively evaluate the performance of our method using four single-cell sequencing datasets and multiple measures for batch effect removal and biological variation conservation. ABC outperforms 10 state-of-the-art methods for this task including Seurat, scGen, ComBat, scanorama, scVI, scANVI, AutoClass, Harmony, scDREAMER, and CLEAR, correcting various types of batch effects while preserving intricate biological variations.

| S-EPMC10781938 | biostudies-literature

Single-cell RNA sequencing of batch Chlamydomonas cultures reveals heterogeneity in their diurnal cycle phase.

Project description:The photosynthetic unicellular alga Chlamydomonas (Chlamydomonas reinhardtii) is a versatile reference for algal biology because of its ease of culture in the laboratory. Genomic and systems biology approaches have previously described transcriptome responses to environmental changes using bulk data, thus representing the average behavior from pools of cells. Here, we apply single-cell RNA sequencing (scRNA-seq) to probe the heterogeneity of Chlamydomonas cell populations under three environments and in two genotypes differing by the presence of a cell wall. First, we determined that RNA can be extracted from single algal cells with or without a cell wall, offering the possibility to sample natural algal communities. Second, scRNA-seq successfully separated single cells into nonoverlapping cell clusters according to their growth conditions. Cells exposed to iron or nitrogen deficiency were easily distinguished despite a shared tendency to arrest photosynthesis and cell division to economize resources. Notably, these groups of cells not only recapitulated known patterns observed with bulk RNA-seq but also revealed their inherent heterogeneity. A substantial source of variation between cells originated from their endogenous diurnal phase, although cultures were grown in constant light. We exploited this result to show that circadian iron responses may be conserved from algae to land plants. We document experimentally that bulk RNA-seq data represent an average of typically hidden heterogeneity in the population.

| S-EPMC8226295 | biostudies-literature

Fast intratumor heterogeneity inference from single-cell sequencing data.

Project description:We introduce HUNTRESS, a computational method for mutational intratumor heterogeneity inference from noisy genotype matrices derived from single-cell sequencing data, the running time of which is linear with the number of cells and quadratic with the number of mutations. We prove that, under reasonable conditions, HUNTRESS computes the true progression history of a tumor with high probability. On simulated and real tumor sequencing data, HUNTRESS is demonstrated to be faster than available alternatives with comparable or better accuracy. Additionally, the progression histories of tumors inferred by HUNTRESS on real single-cell sequencing datasets agree with the best known evolution scenarios for the associated tumors.

| S-EPMC10765963 | biostudies-literature

A benchmark of batch-effect correction methods for single-cell RNA sequencing data.

Project description:BACKGROUND:Large-scale single-cell transcriptomic datasets generated using different technologies contain batch-specific systematic variations that present a challenge to batch-effect removal and data integration. With continued growth expected in scRNA-seq data, achieving effective batch integration with available computational resources is crucial. Here, we perform an in-depth benchmark study on available batch correction methods to determine the most suitable method for batch-effect removal. RESULTS:We compare 14 methods in terms of computational runtime, the ability to handle large datasets, and batch-effect correction efficacy while preserving cell type purity. Five scenarios are designed for the study: identical cell types with different technologies, non-identical cell types, multiple batches, big data, and simulated data. Performance is evaluated using four benchmarking metrics including kBET, LISI, ASW, and ARI. We also investigate the use of batch-corrected data to study differential gene expression. CONCLUSION:Based on our results, Harmony, LIGER, and Seurat 3 are the recommended methods for batch integration. Due to its significantly shorter runtime, Harmony is recommended as the first method to try, with the other methods as viable alternatives.

| S-EPMC6964114 | biostudies-literature

Data denoising with transfer learning in single-cell transcriptomics.

Project description:Single-cell RNA sequencing (scRNA-seq) data are noisy and sparse. Here, we show that transfer learning across datasets remarkably improves data quality. By coupling a deep autoencoder with a Bayesian model, SAVER-X extracts transferable gene-gene relationships across data from different labs, varying conditions and divergent species, to denoise new target datasets.

| S-EPMC7781045 | biostudies-literature

OmicsDI is part of the ELIXIR infrastructure

OmicsDI is an Elixir interoperability service. Learn more ›

Tweets

OmicsDI Databases

PRIDE
PeptideAtlas
MassIVE
JPOST Repository
Physiome Model Repository

EGA
EVA
ENA
LINCS
PAXDB
Cell Collective

MetaboLights
Metabolomics Workbench
MetabolomeExpress
GNPS
BioModels
FAIRDOMHub

ArrayExpress
dbGaP
ExpressionAtlas
GEO
NODE

Information

Databases
Help
API
Contact us
Code on GitHub
Terms of use
Submit Data