Unknown

Dataset Information

0

The generative capacity of probabilistic protein sequence models.


ABSTRACT: Potts models and variational autoencoders (VAEs) have recently gained popularity as generative protein sequence models (GPSMs) to explore fitness landscapes and predict mutation effects. Despite encouraging results, current model evaluation metrics leave unclear whether GPSMs faithfully reproduce the complex multi-residue mutational patterns observed in natural sequences due to epistasis. Here, we develop a set of sequence statistics to assess the "generative capacity" of three current GPSMs: the pairwise Potts Hamiltonian, the VAE, and the site-independent model. We show that the Potts model's generative capacity is largest, as the higher-order mutational statistics generated by the model agree with those observed for natural sequences, while the VAE's lies between the Potts and site-independent models. Importantly, our work provides a new framework for evaluating and interpreting GPSM accuracy which emphasizes the role of higher-order covariation and epistasis, with broader implications for probabilistic sequence models in general.

SUBMITTER: McGee F 

PROVIDER: S-EPMC8563988 | biostudies-literature | 2021 Nov

REPOSITORIES: biostudies-literature

altmetric image

Publications

The generative capacity of probabilistic protein sequence models.

McGee Francisco F   Hauri Sandro S   Novinger Quentin Q   Vucetic Slobodan S   Levy Ronald M RM   Carnevale Vincenzo V   Haldane Allan A  

Nature communications 20211102 1


Potts models and variational autoencoders (VAEs) have recently gained popularity as generative protein sequence models (GPSMs) to explore fitness landscapes and predict mutation effects. Despite encouraging results, current model evaluation metrics leave unclear whether GPSMs faithfully reproduce the complex multi-residue mutational patterns observed in natural sequences due to epistasis. Here, we develop a set of sequence statistics to assess the "generative capacity" of three current GPSMs: th  ...[more]

Similar Datasets

| S-EPMC3117378 | biostudies-literature
| S-EPMC9258900 | biostudies-literature
| S-EPMC11870403 | biostudies-literature
| S-EPMC2440424 | biostudies-literature
| S-EPMC11912200 | biostudies-literature
| S-EPMC10787843 | biostudies-literature
| S-EPMC5387160 | biostudies-literature
| S-EPMC7829634 | biostudies-literature
| S-EPMC10160065 | biostudies-literature
2023-07-10 | GSE221870 | GEO