Browse
Submit Data
Databases
API
Help

Dataset Information

16 Views

0 Connections

0 Citations

0 Reanalyses

0 Downloads

Omics score: 0

A novel sequence alignment algorithm based on deep learning of the protein folding code.

ABSTRACT:

Motivation

From evolutionary interference, function annotation to structural prediction, protein sequence comparison has provided crucial biological insights. While many sequence alignment algorithms have been developed, existing approaches often cannot detect hidden structural relationships in the 'twilight zone' of low sequence identity. To address this critical problem, we introduce a computational algorithm that performs protein Sequence Alignments from deep-Learning of Structural Alignments (SAdLSA, silent 'd'). The key idea is to implicitly learn the protein folding code from many thousands of structural alignments using experimentally determined protein structures.

Results

To demonstrate that the folding code was learned, we first show that SAdLSA trained on pure α-helical proteins successfully recognizes pairs of structurally related pure β-sheet protein domains. Subsequent training and benchmarking on larger, highly challenging datasets show significant improvement over established approaches. For challenging cases, SAdLSA is ∼150% better than HHsearch for generating pairwise alignments and ∼50% better for identifying the proteins with the best alignments in a sequence library. The time complexity of SAdLSA is O(N) thanks to GPU acceleration.

Availability and implementation

Datasets and source codes of SAdLSA are available free of charge for academic users at http://sites.gatech.edu/cssb/sadlsa/.

Supplementary information

Supplementary data are available at Bioinformatics online.

SUBMITTER: Gao M

PROVIDER: S-EPMC8599902 | biostudies-literature |

REPOSITORIES: biostudies-literature

ACCESS DATA

Json Xml

Similar Datasets

Pairwise Heuristic Sequence Alignment Algorithm Based on Deep Reinforcement Learning.

Project description:Goal: Various methods have been developed to analyze the association between organisms and their genomic sequences. Among them, sequence alignment is the most frequently used method for comparative analysis of biological genomes. We intend to propose a novel pairwise sequence alignment method using deep reinforcement learning to break out the old pairwise alignment algorithms. Methods: We defined the environment and agent to enable reinforcement learning in the sequence alignment system. This novel method, named DQNalign, can immediately determine the next direction by observing the subsequences within the moving window. Results: DQNalign shows superiority in the dissimilar sequence pairs that have low identity values. And theoretically, we confirm that DQNalign has a low dimension for the sequence length in view of the complexity. Conclusions: This research shows the application method of deep reinforcement learning to the sequence alignment system and how deep reinforcement learning can improve the conventional sequence alignment method.

| S-EPMC8901008 | biostudies-literature

Sequence-based prediction of protein protein interaction using a deep-learning algorithm.

Project description:BackgroundProtein-protein interactions (PPIs) are critical for many biological processes. It is therefore important to develop accurate high-throughput methods for identifying PPI to better understand protein function, disease occurrence, and therapy design. Though various computational methods for predicting PPI have been developed, their robustness for prediction with external datasets is unknown. Deep-learning algorithms have achieved successful results in diverse areas, but their effectiveness for PPI prediction has not been tested.ResultsWe used a stacked autoencoder, a type of deep-learning algorithm, to study the sequence-based PPI prediction. The best model achieved an average accuracy of 97.19% with 10-fold cross-validation. The prediction accuracies for various external datasets ranged from 87.99% to 99.21%, which are superior to those achieved with previous methods.ConclusionsTo our knowledge, this research is the first to apply a deep-learning algorithm to sequence-based PPI prediction, and the results demonstrate its potential in this field.

| S-EPMC5445391 | biostudies-literature

Distance-based protein folding powered by deep learning.

Project description:Direct coupling analysis (DCA) for protein folding has made very good progress, but it is not effective for proteins that lack many sequence homologs, even coupled with time-consuming conformation sampling with fragments. We show that we can accurately predict interresidue distance distribution of a protein by deep learning, even for proteins with ?60 sequence homologs. Using only the geometric constraints given by the resulting distance matrix we may construct 3D models without involving extensive conformation sampling. Our method successfully folded 21 of the 37 CASP12 hard targets with a median family size of 58 effective sequence homologs within 4 h on a Linux computer of 20 central processing units. In contrast, DCA-predicted contacts cannot be used to fold any of these hard targets in the absence of extensive conformation sampling, and the best CASP12 group folded only 11 of them by integrating DCA-predicted contacts into fragment-based conformation sampling. Rigorous experimental validation in CASP13 shows that our distance-based folding server successfully folded 17 of 32 hard targets (with a median family size of 36 sequence homologs) and obtained 70% precision on the top L/5 long-range predicted contacts. The latest experimental validation in CAMEO shows that our server predicted correct folds for 2 membrane proteins while all of the other servers failed. These results demonstrate that it is now feasible to predict correct fold for many more proteins lack of similar structures in the Protein Data Bank even on a personal computer.

| S-EPMC6708335 | biostudies-literature

Predicting protein-protein interactions through sequence-based deep learning.

Project description:MotivationHigh-throughput experimental techniques have produced a large amount of protein-protein interaction (PPI) data, but their coverage is still low and the PPI data is also very noisy. Computational prediction of PPIs can be used to discover new PPIs and identify errors in the experimental PPI data.ResultsWe present a novel deep learning framework, DPPI, to model and predict PPIs from sequence information alone. Our model efficiently applies a deep, Siamese-like convolutional neural network combined with random projection and data augmentation to predict PPIs, leveraging existing high-quality experimental PPI data and evolutionary information of a protein pair under prediction. Our experimental results show that DPPI outperforms the state-of-the-art methods on several benchmarks in terms of area under precision-recall curve (auPR), and computationally is more efficient. We also show that DPPI is able to predict homodimeric interactions where other methods fail to work accurately, and the effectiveness of DPPI in specific applications such as predicting cytokine-receptor binding affinities.Availability and implementationPredicting protein-protein interactions through sequence-based deep learning): https://github.com/hashemifar/DPPI/.Supplementary informationSupplementary data are available at Bioinformatics online.

| S-EPMC6129267 | biostudies-literature

Robust deep learning-based protein sequence design using ProteinMPNN.

Project description:Although deep learning has revolutionized protein structure prediction, almost all experimentally characterized de novo protein designs have been generated using physically based approaches such as Rosetta. Here, we describe a deep learning-based protein sequence design method, ProteinMPNN, that has outstanding performance in both in silico and experimental tests. On native protein backbones, ProteinMPNN has a sequence recovery of 52.4% compared with 32.9% for Rosetta. The amino acid sequence at different positions can be coupled between single or multiple chains, enabling application to a wide range of current protein design challenges. We demonstrate the broad utility and high accuracy of ProteinMPNN using x-ray crystallography, cryo-electron microscopy, and functional studies by rescuing previously failed designs, which were made using Rosetta or AlphaFold, of protein monomers, cyclic homo-oligomers, tetrahedral nanoparticles, and target-binding proteins.

| S-EPMC9997061 | biostudies-literature

R-PASS: A Fast Structure-based RNA Sequence Alignment Algorithm.

Project description:We present a fast pairwise RNA sequence alignment method using structural information, named R-PASS (RNA Pairwise Alignment of Structure and Sequence), which shows good accuracy on sequences with low sequence identity and significantly faster than alternative methods. The method begins by representing RNA secondary structure as a set of structure motifs. The motifs from two RNAs are then used as input into a bipartite graph-matching algorithm, which determines the structure matches. The matches are then used as constraints in a constrained dynamic programming sequence alignment procedure. The R-PASS method has an O(nm) complexity. We compare our method with two other structure-based alignment methods, LARA and ExpaLoc, and with a sequence-based alignment method, MAFFT, across three benchmarks and obtain favorable results in accuracy and orders of magnitude faster in speed.

| S-EPMC3999979 | biostudies-literature

Development and validation of a deep learning-based protein electrophoresis classification algorithm.

Project description:BackgroundProtein electrophoresis (PEP) is an important tool in supporting the analytical characterization of protein status in diseases related to monoclonal components, inflammation, and antibody deficiency. Here, we developed a deep learning-based PEP classification algorithm to supplement the labor-intensive PEP interpretation and enhance inter-observer reliability.MethodsA total of 2,578 gel images and densitogram PEP images from January 2018 to July 2019 were split into training (80%), validation (10%), and test (10.0%) sets. The PEP images were assessed based on six major findings (acute-phase protein, monoclonal gammopathy, polyclonal gammopathy, hypoproteinemia, nephrotic syndrome, and normal). The images underwent processing, including color-to-grayscale and histogram equalization, and were input into neural networks.ResultsUsing densitogram PEP images, the area under the receiver operating characteristic curve (AUROC) for each diagnosis ranged from 0.873 to 0.989, and the accuracy for classifying all the findings ranged from 85.2% to 96.9%. For gel images, the AUROC ranged from 0.763 to 0.965, and the accuracy ranged from 82.0% to 94.5%.ConclusionsThe deep learning algorithm demonstrated good performance in classifying PEP images. It is expected to be useful as an auxiliary tool for screening the results and helpful in environments where specialists are scarce.

| S-EPMC9401151 | biostudies-literature

Evaluation and optimization of sequence-based gene regulatory deep learning models

Project description:As training dataset for Random Promoter DREAM Challenge 2022, we generated ~6.7 million synthetic promoters (in yeast) comprised of random DNA (N80) and measured their expression by FACS (sorting into 18 bins). As test dataset for Random Promoter DREAM Challenge 2022, we generated ~71k synthetic promoters (in yeast) comprised of designed DNA (NBT) and measured their expression by FACS (sorting into 18 bins).

2024-02-03 | GSE254493 | GEO

Learning the sequence code for mRNA and protein abundance in human immune cells

Project description:mRNA and protein abundance are defined by transcriptional and post-transcriptional regulatory mechanisms. Here, we develop a machine learning pipeline, termed SONAR, to decipher the endogenous sequence code that determines mRNA and protein abundance in human cells. SONAR models predict up to 62% of mRNA and 63% of protein abundance independent of promoter or enhancer information, and reveal a strong—yet dynamic—cell-type specific sequence code. We also find that the effect of sequence features is dependent on their location within the mRNA transcript. Using SONAR, we design synthetic 3’UTRs, with which protein expression levels can be manipulated and tailored to a specific cell-type. Beyond its fundamental findings, our work provides novel means to improve immunotherapies and biotechnology applications.

2023-09-20 | GSE240919 | GEO

Unified rational protein engineering with sequence-based deep representation learning.

Project description:Rational protein engineering requires a holistic understanding of protein function. Here, we apply deep learning to unlabeled amino-acid sequences to distill the fundamental features of a protein into a statistical representation that is semantically rich and structurally, evolutionarily and biophysically grounded. We show that the simplest models built on top of this unified representation (UniRep) are broadly applicable and generalize to unseen regions of sequence space. Our data-driven approach predicts the stability of natural and de novo designed proteins, and the quantitative function of molecularly diverse mutants, competitively with the state-of-the-art methods. UniRep further enables two orders of magnitude efficiency improvement in a protein engineering task. UniRep is a versatile summary of fundamental protein features that can be applied across protein engineering informatics.

| S-EPMC7067682 | biostudies-literature

OmicsDI is part of the ELIXIR infrastructure

OmicsDI is an Elixir interoperability service. Learn more ›

Tweets

OmicsDI Databases

PRIDE
PeptideAtlas
MassIVE
JPOST Repository
Physiome Model Repository

EGA
EVA
ENA
LINCS
PAXDB
Cell Collective

MetaboLights
Metabolomics Workbench
MetabolomeExpress
GNPS
BioModels
FAIRDOMHub

ArrayExpress
dbGaP
ExpressionAtlas
GEO
NODE

Information

Databases
Help
API
Contact us
Code on GitHub
Terms of use
Submit Data