Interpretably deep learning amyloid nucleation by massive experimental quantification of random sequences
Ontology highlight
ABSTRACT: More than 50 human diseases are characterized by the deposition of specific protein aggregates in the form of insoluble amyloid fibrils. However, only a very small number of proteins are known to form amyloids with high propensity, limiting our ability to understand, predict and engineer amyloid aggregation from sequence. Here we use a massively parallel assay to quantify the amyloid nucleation propensity of >100,000 random 20 amino acid sequences. Approximately 5% of assayed random sequences nucleate the formation of aggregates, generating a very large and diverse training dataset from which to train models to predict amyloid nucleation. We use this dataset to train CANYA, a convolution-attention hybrid neural network that predicts the propensity of any primary sequence to form amyloids. CANYA outperforms previous predictors of protein aggregation on additional random sequences and out-of-sample datasets including human disease-causing amyloids, with very stable performance across diverse prediction tasks. We adapt and extend recent advances in interpretability of genomic neural networks to elucidate CANYA’s decision-making process and learned grammar and to provide mechanistic insights into amyloid formation. Our results demonstrate the power of massive experimental random sequence-space exploration and provide an interpretable and robust neural network model for understanding, predicting and designing amyloid-forming proteins.
ORGANISM(S): Saccharomyces cerevisiae
PROVIDER: GSE268261 | GEO | 2024/07/17
REPOSITORIES: GEO
ACCESS DATA