Dataset Information

Continually adapting pre-trained language model to universal annotation of single-cell RNA-seq data.

ABSTRACT:

Motivation

Cell-type annotation of single-cell RNA-sequencing (scRNA-seq) data is a hallmark of biomedical research and clinical application. Current annotation tools usually assume the simultaneous acquisition of well-annotated data, but without the ability to expand knowledge from new data. Yet, such tools are inconsistent with the continuous emergence of scRNA-seq data, calling for a continuous cell-type annotation model. In addition, by their powerful ability of information integration and model interpretability, transformer-based pre-trained language models have led to breakthroughs in single-cell biology research. Therefore, the systematic combining of continual learning and pre-trained language models for cell-type annotation tasks is inevitable.

Results

We herein propose a universal cell-type annotation tool, called CANAL, that continuously fine-tunes a pre-trained language model trained on a large amount of unlabeled scRNA-seq data, as new well-labeled data emerges. CANAL essentially alleviates the dilemma of catastrophic forgetting, both in terms of model inputs and outputs. For model inputs, we introduce an experience replay schema that repeatedly reviews previous vital examples in current training stages. This is achieved through a dynamic example bank with a fixed buffer size. The example bank is class-balanced and proficient in retaining cell-type-specific information, particularly facilitating the consolidation of patterns associated with rare cell types. For model outputs, we utilize representation knowledge distillation to regularize the divergence between previous and current models, resulting in the preservation of knowledge learned from past training stages. Moreover, our universal annotation framework considers the inclusion of new cell types throughout the fine-tuning and testing stages. We can continuously expand the cell-type annotation library by absorbing new cell types from newly arrived, well-annotated training datasets, as well as automatically identify novel cells in unlabeled datasets. Comprehensive experiments with data streams under various biological scenarios demonstrate the versatility and high model interpretability of CANAL.

Availability

An implementation of CANAL is available from https://github.com/aster-ww/CANAL-torch.

Contact

dengmh@pku.edu.cn.

Supplementary information

Supplementary data are available at Journal Name online.

SUBMITTER: Wan H

PROVIDER: S-EPMC10883808 | biostudies-literature | 2024 Jan

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Continually adapting pre-trained language model to universal annotation of single-cell RNA-seq data.

Wan Hui H Yuan Musu M Fu Yiwei Y Deng Minghua M

Briefings in bioinformatics 20240101 2

<h4>Motivation</h4>Cell-type annotation of single-cell RNA-sequencing (scRNA-seq) data is a hallmark of biomedical research and clinical application. Current annotation tools usually assume the simultaneous acquisition of well-annotated data, but without the ability to expand knowledge from new data. Yet, such tools are inconsistent with the continuous emergence of scRNA-seq data, calling for a continuous cell-type annotation model. In addition, by their powerful ability of information integrati ...[more]

PMID: 38388681

Dataset Information

Continually adapting pre-trained language model to universal annotation of single-cell RNA-seq data.

Motivation

Results

Availability

Contact

Supplementary information

Publications

Continually adapting pre-trained language model to universal annotation of single-cell RNA-seq data.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

GenerRNA: A generative pre-trained language model for de novo RNA design.
| S-EPMC11444397 | biostudies-literature

Adapting the Smart-seq2 Protocol for Robust Single Worm RNA-seq.
| S-EPMC5857950 | biostudies-literature

Unsupervised cell functional annotation for single-cell RNA-seq.
| S-EPMC9528981 | biostudies-literature

SCSA: A Cell Type Annotation Tool for Single-Cell RNA-seq Data.
| S-EPMC7235421 | biostudies-literature

Leveraging pre-trained language models for mining microbiome-disease relationships.
| S-EPMC10357883 | biostudies-literature

Epigenetic Impacts of Non-Coding Mutations Deciphered Through Pre-Trained DNA Language Model at Single-Cell Resolution.
| S-EPMC11924033 | biostudies-literature

Cross-species modeling of plant genomes at single nucleotide resolution using a pre-trained DNA language model.
| S-EPMC11185591 | biostudies-literature

STAR: ultrafast universal RNA-seq aligner.
| S-EPMC3530905 | biostudies-literature

Evaluation of Cell Type Annotation R Packages on Single-cell RNA-seq Data.
| S-EPMC8602772 | biostudies-literature

Function annotation of the rice transcriptome at single-nucleotide resolution by RNA-seq.
| S-EPMC2928502 | biostudies-literature