Deciphering the language of antibodies using self-supervised learning
Ontology highlight
ABSTRACT: Summary An individual’s B cell receptor (BCR) repertoire encodes information about past immune responses and potential for future disease protection. Deciphering the information stored in BCR sequence datasets will transform our understanding of disease and enable discovery of novel diagnostics and antibody therapeutics. A key challenge of BCR sequence analysis is the prediction of BCR properties from their amino acid sequence alone. Here, we present an antibody-specific language model, Antibody-specific Bidirectional Encoder Representation from Transformers (AntiBERTa), which provides a contextualized representation of BCR sequences. Following pre-training, we show that AntiBERTa embeddings capture biologically relevant information, generalizable to a range of applications. As a case study, we fine-tune AntiBERTa to predict paratope positions from an antibody sequence, outperforming public tools across multiple metrics. To our knowledge, AntiBERTa is the deepest protein-family-specific language model, providing a rich representation of BCRs. AntiBERTa embeddings are primed for multiple downstream tasks and can improve our understanding of the language of antibodies. Graphical abstract Highlights • AntiBERTa is an antibody-specific transformer model for representation learning• AntiBERTa embeddings capture aspects of antibody function• Attention maps of AntiBERTa correspond to structural contacts and binding sites• AntiBERTa can be fine-tuned for state-of-the-art paratope prediction The bigger picture Understanding antibody function is critical for deciphering the biology of disease and for the discovery of novel therapeutic antibodies. The challenge is the vast diversity of antibody variants compared with the limited labeled data available. We overcome this challenge by using self-supervised learning to train a large antibody-specific language model, followed by transfer learning, to fine-tune the model for predicting information related to antibody function. We initially demonstrate the success of the model by providing leading results in antibody binding site prediction. The model is amenable to further fine-tuning for diverse applications to improve our understanding of antibody function. Antibodies are guardians of the adaptive immune system, with over one billion variants in one individual. Understanding antibody function is critical for deciphering the biology of disease and for discovering novel therapeutics. Here, we present AntiBERTa, a deep-language model that learns the features and syntax, or “language,” of antibodies. We demonstrate the model’s capacity through a range of tasks, such as tracing the B cell origin of the antibody, quantifying immunogenicity, and predicting the antibody’s binding site.
SUBMITTER: Leem J
PROVIDER: S-EPMC9278498 | biostudies-literature |
REPOSITORIES: biostudies-literature
ACCESS DATA