Dataset Information

Embeddings from protein language models predict conservation and variant effects.

ABSTRACT: The emergence of SARS-CoV-2 variants stressed the demand for tools allowing to interpret the effect of single amino acid variants (SAVs) on protein function. While Deep Mutational Scanning (DMS) sets continue to expand our understanding of the mutational landscape of single proteins, the results continue to challenge analyses. Protein Language Models (pLMs) use the latest deep learning (DL) algorithms to leverage growing databases of protein sequences. These methods learn to predict missing or masked amino acids from the context of entire sequence regions. Here, we used pLM representations (embeddings) to predict sequence conservation and SAV effects without multiple sequence alignments (MSAs). Embeddings alone predicted residue conservation almost as accurately from single sequences as ConSeq using MSAs (two-state Matthews Correlation Coefficient-MCC-for ProtT5 embeddings of 0.596 ± 0.006 vs. 0.608 ± 0.006 for ConSeq). Inputting the conservation prediction along with BLOSUM62 substitution scores and pLM mask reconstruction probabilities into a simplistic logistic regression (LR) ensemble for Variant Effect Score Prediction without Alignments (VESPA) predicted SAV effect magnitude without any optimization on DMS data. Comparing predictions for a standard set of 39 DMS experiments to other methods (incl. ESM-1v, DeepSequence, and GEMME) revealed our approach as competitive with the state-of-the-art (SOTA) methods using MSA input. No method outperformed all others, neither consistently nor statistically significantly, independently of the performance measure applied (Spearman and Pearson correlation). Finally, we investigated binary effect predictions on DMS experiments for four human proteins. Overall, embedding-based methods have become competitive with methods relying on MSAs for SAV effect prediction at a fraction of the costs in computing/energy. Our method predicted SAV effects for the entire human proteome (~ 20 k proteins) within 40 min on one Nvidia Quadro RTX 8000. All methods and data sets are freely available for local and online execution through bioembeddings.com, https://github.com/Rostlab/VESPA , and PredictProtein.

SUBMITTER: Marquet C

PROVIDER: S-EPMC8716573 | biostudies-literature | 2022 Oct

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Embeddings from protein language models predict conservation and variant effects.

Marquet Céline C Heinzinger Michael M Olenyi Tobias T Dallago Christian C Erckert Kyra K Bernhofer Michael M Nechaev Dmitrii D Rost Burkhard B

Human genetics 20211230 10

The emergence of SARS-CoV-2 variants stressed the demand for tools allowing to interpret the effect of single amino acid variants (SAVs) on protein function. While Deep Mutational Scanning (DMS) sets continue to expand our understanding of the mutational landscape of single proteins, the results continue to challenge analyses. Protein Language Models (pLMs) use the latest deep learning (DL) algorithms to leverage growing databases of protein sequences. These methods learn to predict missing or m ...[more]

PMID: 34967936

Similar Datasets

Project description:A mosaic of intact native and human-modified vegetation use can provide important habitat for top predators such as the puma (Puma concolor), avoiding negative effects on other species and ecological processes due to cascade trophic interactions. This study investigates the effects of restoration scenarios on the puma's habitat suitability in the most developed Brazilian region (São Paulo State). Species Distribution Models incorporating restoration scenarios were developed using the species' occurrence information to (1) map habitat suitability of pumas in São Paulo State, Southeast, Brazil; (2) test the relative contribution of environmental variables ecologically relevant to the species habitat suitability and (3) project the predicted habitat suitability to future native vegetation restoration scenarios. The Maximum Entropy algorithm was used (Test AUC of 0.84 ± 0.0228) based on seven environmental non-correlated variables and non-autocorrelated presence-only records (n = 342). The percentage of native vegetation (positive influence), elevation (positive influence) and density of roads (negative influence) were considered the most important environmental variables to the model. Model projections to restoration scenarios reflected the high positive relationship between pumas and native vegetation. These projections identified new high suitability areas for pumas (probability of presence >0.5) in highly deforested regions. High suitability areas were increased from 5.3% to 8.5% of the total State extension when the landscapes were restored for ≥ the minimum native vegetation cover rule (20%) established by the Brazilian Forest Code in private lands. This study highlights the importance of a landscape planning approach to improve the conservation outlook for pumas and other species, including not only the establishment and management of protected areas, but also the habitat restoration on private lands. Importantly, the results may inform environmental policies and land use planning in São Paulo State, Brazil.

Dataset Information

Embeddings from protein language models predict conservation and variant effects.

Publications

Embeddings from protein language models predict conservation and variant effects.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets