Dataset Information

The Power of Universal Contextualized Protein Embeddings in Cross-species Protein Function Prediction.

ABSTRACT: Computationally annotating proteins with a molecular function is a difficult problem that is made even harder due to the limited amount of available labeled protein training data. Unsupervised protein embeddings partly circumvent this limitation by learning a universal protein representation from many unlabeled sequences. Such embeddings incorporate contextual information of amino acids, thereby modeling the underlying principles of protein sequences insensitive to the context of species. We used an existing pre-trained protein embedding method and subjected its molecular function prediction performance to detailed characterization, first to advance the understanding of protein language models, and second to determine areas of improvement. Then, we applied the model in a transfer learning task by training a function predictor based on the embeddings of annotated protein sequences of one training species and making predictions on the proteins of several test species with varying evolutionary distance. We show that this approach successfully generalizes knowledge about protein function from one eukaryotic species to various other species, outperforming both an alignment-based and a supervised-learning-based baseline. This implies that such a method could be effective for molecular function prediction in inadequately annotated species from understudied taxonomic kingdoms.

SUBMITTER: van den Bent I

PROVIDER: S-EPMC8647222 | biostudies-literature | 2021

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

The Power of Universal Contextualized Protein Embeddings in Cross-species Protein Function Prediction.

van den Bent Irene I Makrodimitris Stavros S Reinders Marcel M

Evolutionary bioinformatics online 20211203

Computationally annotating proteins with a molecular function is a difficult problem that is made even harder due to the limited amount of available labeled protein training data. Unsupervised protein embeddings partly circumvent this limitation by learning a universal protein representation from many unlabeled sequences. Such embeddings incorporate contextual information of amino acids, thereby modeling the underlying principles of protein sequences insensitive to the context of species. We use ...[more]

PMID: 34880594

Dataset Information

The Power of Universal Contextualized Protein Embeddings in Cross-species Protein Function Prediction.

Publications

The Power of Universal Contextualized Protein Embeddings in Cross-species Protein Function Prediction.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

Med-BERT: pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction.
| S-EPMC8137882 | biostudies-literature

SAP: Synteny-aware gene function prediction for bacteria using protein embeddings.
| S-EPMC10187222 | biostudies-literature

The Power of Resolution: Contextualized Understanding of Chemical-biological Interactions
2020-03-06 | GSE145994 | GEO

Superior protein thermophilicity prediction with protein language model embeddings.
| S-EPMC10566323 | biostudies-literature

Combining Contextualized Embeddings and Prior Knowledge for Clinical Named Entity Recognition: Evaluation Study.
| S-EPMC6913757 | biostudies-literature

EPIC-TRACE: predicting TCR binding to unseen epitopes using attention and contextualized embeddings.
| S-EPMC10963061 | biostudies-literature

Toward universal cell embeddings: integrating single-cell RNA-seq datasets across species with SATURN.
| S-EPMC11310084 | biostudies-literature

Towards Universal Cell Embeddings: Integrating Single-cell RNA-seq Datasets across Species with SATURN.
| S-EPMC9915700 | biostudies-literature

EpitopeVec: linear epitope prediction using deep protein sequence embeddings.
| S-EPMC8652027 | biostudies-literature

Improving protein succinylation sites prediction using embeddings from protein language model.
| S-EPMC9547369 | biostudies-literature