Browse
Submit Data
Databases
API
Help

Dataset Information

18 Views

0 Connections

0 Citations

0 Reanalyses

0 Downloads

Omics score: 0

Interpreting Potts and Transformer Protein Models Through the Lens of Simplified Attention.

ABSTRACT: The established approach to unsupervised protein contact prediction estimates coevolving positions using undirected graphical models. This approach trains a Potts model on a Multiple Sequence Alignment. Increasingly large Transformers are being pretrained on unlabeled, unaligned protein sequence databases and showing competitive performance on protein contact prediction. We argue that attention is a principled model of protein interactions, grounded in real properties of protein family data. We introduce an energy-based attention layer, factored attention, which, in a certain limit, recovers a Potts model, and use it to contrast Potts and Transformers. We show that the Transformer leverages hierarchical signal in protein family databases not captured by single-layer models. This raises the exciting possibility for the development of powerful structured models of protein family databases.

SUBMITTER: Bhattacharya N

PROVIDER: S-EPMC8752338 | biostudies-literature |

REPOSITORIES: biostudies-literature

ACCESS DATA

Json Xml

Similar Datasets

S-Swin Transformer: simplified Swin Transformer model for offline handwritten Chinese character recognition.

Project description:The Transformer shows good prospects in computer vision. However, the Swin Transformer model has the disadvantage of a large number of parameters and high computational effort. To effectively solve these problems of the model, a simplified Swin Transformer (S-Swin Transformer) model was proposed in this article for handwritten Chinese character recognition. The model simplifies the initial four hierarchical stages into three hierarchical stages. In addition, the new model increases the size of the window in the window attention; the number of patches in the window is larger; and the perceptual field of the window is increased. As the network model deepens, the size of patches becomes larger, and the perceived range of each patch increases. Meanwhile, the purpose of shifting the window's attention is to enhance the information interaction between the window and the window. Experimental results show that the verification accuracy improves slightly as the window becomes larger. The best validation accuracy of the simplified Swin Transformer model on the dataset reached 95.70%. The number of parameters is only 8.69 million, and FLOPs are 2.90G, which greatly reduces the number of parameters and computation of the model and proves the correctness and validity of the proposed model.

| S-EPMC9575930 | biostudies-literature

Interpreting models interpreting brain dynamics.

Project description:Brain dynamics are highly complex and yet hold the key to understanding brain function and dysfunction. The dynamics captured by resting-state functional magnetic resonance imaging data are noisy, high-dimensional, and not readily interpretable. The typical approach of reducing this data to low-dimensional features and focusing on the most predictive features comes with strong assumptions and can miss essential aspects of the underlying dynamics. In contrast, introspection of discriminatively trained deep learning models may uncover disorder-relevant elements of the signal at the level of individual time points and spatial locations. Yet, the difficulty of reliable training on high-dimensional low sample size datasets and the unclear relevance of the resulting predictive markers prevent the widespread use of deep learning in functional neuroimaging. In this work, we introduce a deep learning framework to learn from high-dimensional dynamical data while maintaining stable, ecologically valid interpretations. Results successfully demonstrate that the proposed framework enables learning the dynamics of resting-state fMRI directly from small data and capturing compact, stable interpretations of features predictive of function and dysfunction.

| S-EPMC9304350 | biostudies-literature

Remote homology search with hidden Potts models.

Project description:Most methods for biological sequence homology search and alignment work with primary sequence alone, neglecting higher-order correlations. Recently, statistical physics models called Potts models have been used to infer all-by-all pairwise correlations between sites in deep multiple sequence alignments, and these pairwise couplings have improved 3D structure predictions. Here we extend the use of Potts models from structure prediction to sequence alignment and homology search by developing what we call a hidden Potts model (HPM) that merges a Potts emission process to a generative probability model of insertion and deletion. Because an HPM is incompatible with efficient dynamic programming alignment algorithms, we develop an approximate algorithm based on importance sampling, using simpler probabilistic models as proposal distributions. We test an HPM implementation on RNA structure homology search benchmarks, where we can compare directly to exact alignment methods that capture nested RNA base-pairing correlations (stochastic context-free grammars). HPMs perform promisingly in these proof of principle experiments.

| S-EPMC7728182 | biostudies-literature

Potts Hamiltonian models of protein co-variation, free energy landscapes, and evolutionary fitness.

Project description:Potts Hamiltonian models of protein sequence co-variation are statistical models constructed from the pair correlations observed in a multiple sequence alignment (MSA) of a protein family. These models are powerful because they capture higher order correlations induced by mutations evolving under constraints and help quantify the connections between protein sequence, structure, and function maintained through evolution. We review recent work with Potts models to predict protein structure and sequence-dependent conformational free energy landscapes, to survey protein fitness landscapes and to explore the effects of epistasis on fitness. We also comment on the numerical methods used to infer these models for each application.

| S-EPMC5869684 | biostudies-literature

Diabetes through a 3D lens: organoid models.

Project description:Diabetes is one of the most challenging health concerns facing society. Available drugs treat the symptoms but there is no cure. This presents an urgent need to better understand human diabetes in order to develop improved treatments or target remission. New disease models need to be developed that more accurately describe the pathology of diabetes. Organoid technology provides an opportunity to fill this knowledge gap. Organoids are 3D structures, established from pluripotent stem cells or adult stem/progenitor cells, that recapitulate key aspects of the in vivo tissues they mimic. In this review we briefly introduce organoids and their benefits; we focus on organoids generated from tissues important for glucose homeostasis and tissues associated with diabetic complications. We hope this review serves as a touchstone to demonstrate how organoid technology extends the research toolbox and can deliver a step change of discovery in the field of diabetes.

| S-EPMC7228904 | biostudies-literature

Decomposing protein-DNA binding and recognition using simplified protein models.

Project description:We analyze the role of different physicochemical factors in protein/DNA binding and recognition by comparing the results from all-atom molecular dynamics simulations with simulations using simplified protein models. These models enable us to separate the role of specific amino acid side chains, formal amino acid charges and hydrogen bonding from the effects of the low-dielectric volume occupied by the protein. Comparisons are made on the basis of the conformation of DNA after protein binding, the ionic distribution around the complex and the sequence specificity. The results for four transcription factors, binding in either the minor or major grooves of DNA, show that the protein volume and formal charges, with one exception, play a predominant role in binding. Adding hydrogen bonding and a very small number of key amino acid side chains at the all-atom level yields results in DNA conformations and sequence recognition close to those seen in the reference all-atom simulations.

| S-EPMC5622342 | biostudies-literature

Toward Inferring Potts Models for Phylogenetically Correlated Sequence Data

Project description:Global coevolutionary models of protein families have become increasingly popular due to their capacity to predict residue–residue contacts from sequence information, but also to predict fitness effects of amino acid substitutions or to infer protein–protein interactions. The central idea in these models is to construct a probability distribution, a Potts model, that reproduces single and pairwise frequencies of amino acids found in natural sequences of the protein family. This approach treats sequences from the family as independent samples, completely ignoring phylogenetic relations between them. This simplification is known to lead to potentially biased estimates of the parameters of the model, decreasing their biological relevance. Current workarounds for this problem, such as reweighting sequences, are poorly understood and not principled. Here, we propose an inference scheme that takes the phylogeny of a protein family into account in order to correct biases in estimating the frequencies of amino acids. Using artificial data, we show that a Potts model inferred using these corrected frequencies performs better in predicting contacts and fitness effect of mutations. First, only partially successful tests on real protein data are presented, too.

| S-EPMC7514434 | biostudies-literature

Interpreting mosquito feeding patterns in Australia through an ecological lens: an analysis of blood meal studies.

Project description:BackgroundMosquito-borne pathogens contribute significantly to the global burden of disease, infecting millions of people each year. Mosquito feeding is critical to the transmission dynamics of pathogens, and thus it is important to understanding and interpreting mosquito feeding patterns. In this paper we explore mosquito feeding patterns and their implications for disease ecology through a meta-analysis of published blood meal results collected across Australia from more than 12,000 blood meals from 22 species. To assess mosquito-vertebrate associations and identify mosquitoes on a spectrum of generalist or specialist feeders, we analysed blood meal data in two ways; first using a novel odds ratio analysis, and secondly by calculating Shannon's diversity scores.ResultsWe find that each mosquito species had a unique feeding association with different vertebrates, suggesting species-specific feeding patterns. Broadly, mosquito species could be grouped broadly into those that were primarily ornithophilic and those that fed more often on livestock. Aggregated feeding patterns observed across Australia were not explained by intrinsic variables such as mosquito genetics or larval habitats. We discuss the implications for disease transmission by vector mosquito species classified as generalist-feeders (such as Aedes vigilax and Culex annulirostris), or specialists (such as Aedes aegypti) in light of potential influences on mosquito host choice.ConclusionsOverall, we find that whilst existing blood meal studies in Australia are useful for investigating mosquito feeding patterns, standardisation of blood meal study methodologies and analyses, including the incorporation of vertebrate surveys, would improve predictions of the impact of vector-host interactions on disease ecology. Our analysis can also be used as a framework to explore mosquito-vertebrate associations, in which host availability data is unavailable, in other global systems.

| S-EPMC6448275 | biostudies-literature

How good are simplified models for protein structure prediction?

Project description:Protein structure prediction (PSP) has been one of the most challenging problems in computational biology for several decades. The challenge is largely due to the complexity of the all-atomic details and the unknown nature of the energy function. Researchers have therefore used simplified energy models that consider interaction potentials only between the amino acid monomers in contact on discrete lattices. The restricted nature of the lattices and the energy models poses a twofold concern regarding the assessment of the models. Can a native or a very close structure be obtained when structures are mapped to lattices? Can the contact based energy models on discrete lattices guide the search towards the native structures? In this paper, we use the protein chain lattice fitting (PCLF) problem to address the first concern; we developed a constraint-based local search algorithm for the PCLF problem for cubic and face-centered cubic lattices and found very close lattice fits for the native structures. For the second concern, we use a number of techniques to sample the conformation space and find correlations between energy functions and root mean square deviation (RMSD) distance of the lattice-based structures with the native structures. Our analysis reveals weakness of several contact based energy models used that are popular in PSP.

| S-EPMC4022063 | biostudies-literature

Interpreting protein abundance in Saccharomyces cerevisiae through relational learning.

Project description:MotivationProteomic profiles reflect the functional readout of the physiological state of an organism. An increased understanding of what controls and defines protein abundances is of high scientific interest. Saccharomyces cerevisiae is a well-studied model organism, and there is a large amount of structured knowledge on yeast systems biology in databases such as the Saccharomyces Genome Database, and highly curated genome-scale metabolic models like Yeast8. These datasets, the result of decades of experiments, are abundant in information, and adhere to semantically meaningful ontologies.ResultsBy representing this knowledge in an expressive Datalog database we generated data descriptors using relational learning that, when combined with supervised machine learning, enables us to predict protein abundances in an explainable manner. We learnt predictive relationships between protein abundances, function and phenotype; such as α-amino acid accumulations and deviations in chronological lifespan. We further demonstrate the power of this methodology on the proteins His4 and Ilv2, connecting qualitative biological concepts to quantified abundances.Availability and implementationAll data and processing scripts are available at the following Github repository: https://github.com/DanielBrunnsaker/ProtPredict.

| S-EPMC10868306 | biostudies-literature

OmicsDI is part of the ELIXIR infrastructure

OmicsDI is an Elixir interoperability service. Learn more ›

Tweets

OmicsDI Databases

PRIDE
PeptideAtlas
MassIVE
JPOST Repository
Physiome Model Repository

EGA
EVA
ENA
LINCS
PAXDB
Cell Collective

MetaboLights
Metabolomics Workbench
MetabolomeExpress
GNPS
BioModels
FAIRDOMHub

ArrayExpress
dbGaP
ExpressionAtlas
GEO
NODE

Information

Databases
Help
API
Contact us
Code on GitHub
Terms of use
Submit Data