Unknown

Dataset Information

0

Evaluating Protein Transfer Learning with TAPE.


ABSTRACT: Machine learning applied to protein sequences is an increasingly popular area of research. Semi-supervised learning for proteins has emerged as an important paradigm due to the high cost of acquiring supervised protein labels, but the current literature is fragmented when it comes to datasets and standardized evaluation techniques. To facilitate progress in this field, we introduce the Tasks Assessing Protein Embeddings (TAPE), a set of five biologically relevant semi-supervised learning tasks spread across different domains of protein biology. We curate tasks into specific training, validation, and test splits to ensure that each task tests biologically relevant generalization that transfers to real-life scenarios. We benchmark a range of approaches to semi-supervised protein representation learning, which span recent work as well as canonical sequence learning techniques. We find that self-supervised pretraining is helpful for almost all models on all tasks, more than doubling performance in some cases. Despite this increase, in several cases features learned by self-supervised pretraining still lag behind features extracted by state-of-the-art non-neural techniques. This gap in performance suggests a huge opportunity for innovative architecture design and improved modeling paradigms that better capture the signal in biological sequences. TAPE will help the machine learning community focus effort on scientifically relevant problems. Toward this end, all data and code used to run these experiments are available at https://github.com/songlab-cal/tape.

SUBMITTER: Rao R 

PROVIDER: S-EPMC7774645 | biostudies-literature | 2019 Dec

REPOSITORIES: biostudies-literature

altmetric image

Publications

Evaluating Protein Transfer Learning with TAPE.

Rao Roshan R   Bhattacharya Nicholas N   Thomas Neil N   Duan Yan Y   Chen Xi X   Canny John J   Abbeel Pieter P   Song Yun S YS  

Advances in neural information processing systems 20191201


Machine learning applied to protein sequences is an increasingly popular area of research. Semi-supervised learning for proteins has emerged as an important paradigm due to the high cost of acquiring supervised protein labels, but the current literature is fragmented when it comes to datasets and standardized evaluation techniques. To facilitate progress in this field, we introduce the Tasks Assessing Protein Embeddings (TAPE), a set of five biologically relevant semi-supervised learning tasks s  ...[more]

Similar Datasets

| S-EPMC8452401 | biostudies-literature
| S-EPMC4325334 | biostudies-literature
| S-EPMC6094388 | biostudies-literature
| S-EPMC4639845 | biostudies-literature
| S-EPMC4730153 | biostudies-literature
| S-EPMC7179114 | biostudies-literature
| S-EPMC3374840 | biostudies-literature
| S-EPMC10311290 | biostudies-literature
| S-EPMC6227951 | biostudies-other
2021-04-07 | GSE171636 | GEO