Browse
Submit Data
Databases
API
Help

Dataset Information

28 Views

0 Connections

0 Citations

0 Reanalyses

0 Downloads

Omics score: 0

Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences.

ABSTRACT: In the field of artificial intelligence, a combination of scale in data and model capacity enabled by unsupervised learning has led to major advances in representation learning and statistical generation. In the life sciences, the anticipated growth of sequencing promises unprecedented data on natural sequence diversity. Protein language modeling at the scale of evolution is a logical step toward predictive and generative artificial intelligence for biology. To this end, we use unsupervised learning to train a deep contextual language model on 86 billion amino acids across 250 million protein sequences spanning evolutionary diversity. The resulting model contains information about biological properties in its representations. The representations are learned from sequence data alone. The learned representation space has a multiscale organization reflecting structure from the level of biochemical properties of amino acids to remote homology of proteins. Information about secondary and tertiary structure is encoded in the representations and can be identified by linear projections. Representation learning produces features that generalize across a range of applications, enabling state-of-the-art supervised prediction of mutational effect and secondary structure and improving state-of-the-art features for long-range contact prediction.

SUBMITTER: Rives A

PROVIDER: S-EPMC8053943 | biostudies-literature |

REPOSITORIES: biostudies-literature

ACCESS DATA

Json Xml

Similar Datasets

DeLUCS: Deep learning for unsupervised clustering of DNA sequences.

Project description:We present a novel Deep Learning method for the Unsupervised Clustering of DNA Sequences (DeLUCS) that does not require sequence alignment, sequence homology, or (taxonomic) identifiers. DeLUCS uses Frequency Chaos Game Representations (FCGR) of primary DNA sequences, and generates "mimic" sequence FCGRs to self-learn data patterns (genomic signatures) through the optimization of multiple neural networks. A majority voting scheme is then used to determine the final cluster assignment for each sequence. The clusters learned by DeLUCS match true taxonomic groups for large and diverse datasets, with accuracies ranging from 77% to 100%: 2,500 complete vertebrate mitochondrial genomes, at taxonomic levels from sub-phylum to genera; 3,200 randomly selected 400 kbp-long bacterial genome segments, into clusters corresponding to bacterial families; three viral genome and gene datasets, averaging 1,300 sequences each, into clusters corresponding to virus subtypes. DeLUCS significantly outperforms two classic clustering methods (K-means++ and Gaussian Mixture Models) for unlabelled data, by as much as 47%. DeLUCS is highly effective, it is able to cluster datasets of unlabelled primary DNA sequences totalling over 1 billion bp of data, and it bypasses common limitations to classification resulting from the lack of sequence homology, variation in sequence length, and the absence or instability of sequence annotations and taxonomic identifiers. Thus, DeLUCS offers fast and accurate DNA sequence clustering for previously intractable datasets.

| S-EPMC8782307 | biostudies-literature

Zymogenic latency in an ∼250-million-year-old astacin metallopeptidase.

Project description:The horseshoe crab Limulus polyphemus is one of few extant Limulus species, which date back to ∼250 million years ago under the conservation of a common Bauplan documented by fossil records. It possesses the only proteolytic blood-coagulation and innate immunity system outside vertebrates and is a model organism for the study of the evolution and function of peptidases. The astacins are a family of metallopeptidases that share a central ∼200-residue catalytic domain (CD), which is found in >1000 species across holozoans and, sporadically, bacteria. Here, the zymogen of an astacin from L. polyphemus was crystallized and its structure was solved. A 34-residue, mostly unstructured pro-peptide (PP) traverses, and thus blocks, the active-site cleft of the CD in the opposite direction to a substrate. A central `PP motif' (F35-E-G-D-I39) adopts a loop structure which positions Asp38 to bind the catalytic metal, replacing the solvent molecule required for catalysis in the mature enzyme according to an `aspartate-switch' mechanism. Maturation cleavage of the PP liberates the cleft and causes the rearrangement of an `activation segment'. Moreover, the mature N-terminus is repositioned to penetrate the CD moiety and is anchored to a buried `family-specific' glutamate. Overall, this mechanism of latency is reminiscent of that of the other three astacins with known zymogenic and mature structures, namely crayfish astacin, human meprin β and bacterial myroilysin, but each shows specific structural characteristics. Remarkably, myroilysin lacks the PP motif and employs a cysteine instead of the aspartate to block the catalytic metal.

| S-EPMC9629494 | biostudies-literature

Unsupervised Deep Learning Can Identify Protein Functional Groups from Unaligned Sequences.

Project description:Interpreting protein function from sequence data is a fundamental goal of bioinformatics. However, our current understanding of protein diversity is bottlenecked by the fact that most proteins have only been functionally validated in model organisms, limiting our understanding of how function varies with gene sequence diversity. Thus, accuracy of inferences in clades without model representatives is questionable. Unsupervised learning may help to ameliorate this bias by identifying highly complex patterns and structure from large datasets without external labels. Here we present DeepSeqProt, an unsupervised deep learning program for exploring large protein sequence datasets. DeepSeqProt is a clustering tool capable of distinguishing between broad classes of proteins while learning local and global structure of functional space. DeepSeqProt is capable of learning salient biological features from unaligned, unannotated sequences. DeepSeqProt is more likely to capture complete protein families and statistically significant shared ontologies within proteomes than other clustering methods. We hope this framework will prove of use to researchers and provide a preliminary step in further developing unsupervised deep learning in molecular biology.

| S-EPMC10231473 | biostudies-literature

Anomaly Detection in Biological Early Warning Systems Using Unsupervised Machine Learning.

Project description:The use of bivalve mollusks as bioindicators in automated monitoring systems can provide real-time detection of emergency situations associated with the pollution of aquatic environments. The behavioral reactions of Unio pictorum (Linnaeus, 1758) were employed in the development of a comprehensive automated monitoring system for aquatic environments by the authors. The study used experimental data obtained by an automated system from the Chernaya River in the Sevastopol region of the Crimean Peninsula. Four traditional unsupervised machine learning techniques were implemented to detect emergency signals in the activity of bivalves: elliptic envelope, isolation forest (iForest), one-class support vector machine (SVM), and local outlier factor (LOF). The results showed that the use of the elliptic envelope, iForest, and LOF methods with proper hyperparameter tuning can detect anomalies in mollusk activity data without false alarms, with an F1 score of 1. A comparison of anomaly detection times revealed that the iForest method is the most efficient. These findings demonstrate the potential of using bivalve mollusks as bioindicators in automated monitoring systems for the early detection of pollution in aquatic environments.

| S-EPMC10007031 | biostudies-literature

pysster: classification of biological sequences by learning sequence and structure motifs with convolutional neural networks.

Project description:Summary:Convolutional neural networks (CNNs) have been shown to perform exceptionally well in a variety of tasks, including biological sequence classification. Available implementations, however, are usually optimized for a particular task and difficult to reuse. To enable researchers to utilize these networks more easily, we implemented pysster, a Python package for training CNNs on biological sequence data. Sequences are classified by learning sequence and structure motifs and the package offers an automated hyper-parameter optimization procedure and options to visualize learned motifs along with information about their positional and class enrichment. The package runs seamlessly on CPU and GPU and provides a simple interface to train and evaluate a network with a handful lines of code. Using an RNA A-to-I editing dataset and cross-linking immunoprecipitation (CLIP)-seq binding site sequences, we demonstrate that pysster classifies sequences with higher accuracy than previous methods, such as GraphProt or ssHMM, and is able to recover known sequence and structure motifs. Availability and implementation:pysster is freely available at https://github.com/budach/pysster. Supplementary information:Supplementary data are available at Bioinformatics online.

| S-EPMC6129303 | biostudies-literature

Unsupervised-learning-based method for chest MRI-CT transformation using structure constrained unsupervised generative attention networks.

Project description:The integrated positron emission tomography/magnetic resonance imaging (PET/MRI) scanner simultaneously acquires metabolic information via PET and morphological information using MRI. However, attenuation correction, which is necessary for quantitative PET evaluation, is difficult as it requires the generation of attenuation-correction maps from MRI, which has no direct relationship with the gamma-ray attenuation information. MRI-based bone tissue segmentation is potentially available for attenuation correction in relatively rigid and fixed organs such as the head and pelvis regions. However, this is challenging for the chest region because of respiratory and cardiac motions in the chest, its anatomically complicated structure, and the thin bone cortex. We propose a new method using unsupervised generative attentional networks with adaptive layer-instance normalisation for image-to-image translation (U-GAT-IT), which specialised in unpaired image transformation based on attention maps for image transformation. We added the modality-independent neighbourhood descriptor (MIND) to the loss of U-GAT-IT to guarantee anatomical consistency in the image transformation between different domains. Our proposed method obtained a synthesised computed tomography of the chest. Experimental results showed that our method outperforms current approaches. The study findings suggest the possibility of synthesising clinically acceptable computed tomography images from chest MRI with minimal changes in anatomical structures without human annotation.

| S-EPMC9247083 | biostudies-literature

Unsupervised Learning and Pattern Recognition of Biological Data Structures with Density Functional Theory and Machine Learning.

Project description:By introducing the methods of machine learning into the density functional theory, we made a detour for the construction of the most probable density function, which can be estimated by learning relevant features from the system of interest. Using the properties of universal functional, the vital core of density functional theory, the most probable cluster numbers and the corresponding cluster boundaries in a studying system can be simultaneously and automatically determined and the plausibility is erected on the Hohenberg-Kohn theorems. For the method validation and pragmatic applications, interdisciplinary problems from physical to biological systems were enumerated. The amalgamation of uncharged atomic clusters validated the unsupervised searching process of the cluster numbers and the corresponding cluster boundaries were exhibited likewise. High accurate clustering results of the Fisher's iris dataset showed the feasibility and the flexibility of the proposed scheme. Brain tumor detections from low-dimensional magnetic resonance imaging datasets and segmentations of high-dimensional neural network imageries in the Brainbow system were also used to inspect the method practicality. The experimental results exhibit the successful connection between the physical theory and the machine learning methods and will benefit the clinical diagnoses.

| S-EPMC5765025 | biostudies-literature

adabmDCA: adaptive Boltzmann machine learning for biological sequences.

Project description:BackgroundBoltzmann machines are energy-based models that have been shown to provide an accurate statistical description of domains of evolutionary-related protein and RNA families. They are parametrized in terms of local biases accounting for residue conservation, and pairwise terms to model epistatic coevolution between residues. From the model parameters, it is possible to extract an accurate prediction of the three-dimensional contact map of the target domain. More recently, the accuracy of these models has been also assessed in terms of their ability in predicting mutational effects and generating in silico functional sequences.ResultsOur adaptive implementation of Boltzmann machine learning, adabmDCA, can be generally applied to both protein and RNA families and accomplishes several learning set-ups, depending on the complexity of the input data and on the user requirements. The code is fully available at https://github.com/anna-pa-m/adabmDCA . As an example, we have performed the learning of three Boltzmann machines modeling the Kunitz and Beta-lactamase2 protein domains and TPP-riboswitch RNA domain.ConclusionsThe models learned by adabmDCA are comparable to those obtained by state-of-the-art techniques for this task, in terms of the quality of the inferred contact map as well as of the synthetically generated sequences. In addition, the code implements both equilibrium and out-of-equilibrium learning, which allows for an accurate and lossless training when the equilibrium one is prohibitive in terms of computational time, and allows for pruning irrelevant parameters using an information-based criterion.

| S-EPMC8555268 | biostudies-literature

Ancient Rhamnaceae flowers impute an origin for flowering plants exceeding 250-million-years ago.

Project description:Setting the molecular clock to newly described 100-million-year-old flowering shoots of Phylica in Burmese amber enabled us to recalibrate the phylogenetic history of Rhamnaceae. We traced its origin to ∼260 million years ago (Ma) that can explain its migration within and beyond Gondwana since that time and implies an origin for flowering plants that stretches well beyond 290 Ma. Ancestral trait assignments also revealed that hard-seededness, fire-proneness, and to a lesser extent, heat-released seed dormancy, have a similarly long history in this clade.

| S-EPMC9254029 | biostudies-literature

A nature inspired modularity function for unsupervised learning involving spatially embedded networks.

Project description:The quality of network clustering is often measured in terms of a commonly used metric known as "modularity". Modularity compares the clusters found in a network to those present in a random graph (a "null model"). Unfortunately, modularity is somewhat ill suited for studying spatially embedded networks, since a random graph contains no basic geometrical notions. Regardless of their distance, the null model assigns a nonzero probability for an edge to appear between any pair of nodes. Here, we propose a variant of modularity that does not rely on the use of a null model. To demonstrate the essentials of our method, we analyze networks generated from granular ensemble. We show that our method performs better than the most commonly used Newman-Girvan (NG) modularity in detecting the best (physically transparent) partitions in those systems. Our measure further properly detects hierarchical structures, whenever these are present.

| S-EPMC6385190 | biostudies-literature

OmicsDI is part of the ELIXIR infrastructure

OmicsDI is an Elixir interoperability service. Learn more ›

Tweets

OmicsDI Databases

PRIDE
PeptideAtlas
MassIVE
JPOST Repository
Physiome Model Repository

EGA
EVA
ENA
LINCS
PAXDB
Cell Collective

MetaboLights
Metabolomics Workbench
MetabolomeExpress
GNPS
BioModels
FAIRDOMHub

ArrayExpress
dbGaP
ExpressionAtlas
GEO
NODE

Information

Databases
Help
API
Contact us
Code on GitHub
Terms of use
Submit Data