Other

Dataset Information

0

Interpretably deep learning amyloid nucleation by massive experimental quantification of random sequences


ABSTRACT: More than 50 human diseases are characterized by the deposition of specific protein aggregates in the form of insoluble amyloid fibrils. However, only a very small number of proteins are known to form amyloids with high propensity, limiting our ability to understand, predict and engineer amyloid aggregation from sequence. Here we use a massively parallel assay to quantify the amyloid nucleation propensity of >100,000 random 20 amino acid sequences. Approximately 5% of assayed random sequences nucleate the formation of aggregates, generating a very large and diverse training dataset from which to train models to predict amyloid nucleation. We use this dataset to train CANYA, a convolution-attention hybrid neural network that predicts the propensity of any primary sequence to form amyloids. CANYA outperforms previous predictors of protein aggregation on additional random sequences and out-of-sample datasets including human disease-causing amyloids, with very stable performance across diverse prediction tasks. We adapt and extend recent advances in interpretability of genomic neural networks to elucidate CANYA’s decision-making process and learned grammar and to provide mechanistic insights into amyloid formation. Our results demonstrate the power of massive experimental random sequence-space exploration and provide an interpretable and robust neural network model for understanding, predicting and designing amyloid-forming proteins.

ORGANISM(S): Saccharomyces cerevisiae

PROVIDER: GSE268261 | GEO | 2024/07/17

REPOSITORIES: GEO

Similar Datasets

2023-10-09 | GSE244612 | GEO
2024-08-02 | GSE270792 | GEO
2015-04-15 | MODEL1410310000 | BioModels
2022-02-28 | PXD006640 | Pride
2024-07-26 | GSE269461 | GEO
2022-10-27 | GSE193837 | GEO
2022-07-16 | PXD019498 | Pride
2022-04-01 | MSV000089188 | MassIVE
2017-08-09 | PXD006835 | Pride
2024-07-09 | PXD046280 | Pride