Project description:We describe a fully data driven model that learns to perform a retrosynthetic reaction prediction task, which is treated as a sequence-to-sequence mapping problem. The end-to-end trained model has an encoder-decoder architecture that consists of two recurrent neural networks, which has previously shown great success in solving other sequence-to-sequence prediction tasks such as machine translation. The model is trained on 50,000 experimental reaction examples from the United States patent literature, which span 10 broad reaction types that are commonly used by medicinal chemists. We find that our model performs comparably with a rule-based expert system baseline model, and also overcomes certain limitations associated with rule-based expert systems and with any machine learning approach that contains a rule-based expert system component. Our model provides an important first step toward solving the challenging problem of computational retrosynthetic analysis.
Project description:Magnetic resonance imaging offers unrivaled visualization of the fetal brain, forming the basis for establishing age-specific morphologic milestones. However, gauging age-appropriate neural development remains a difficult task due to the constantly changing appearance of the fetal brain, variable image quality, and frequent motion artifacts. Here we present an end-to-end, attention-guided deep learning model that predicts gestational age with R2 score of 0.945, mean absolute error of 6.7 days, and concordance correlation coefficient of 0.970. The convolutional neural network was trained on a heterogeneous dataset of 741 developmentally normal fetal brain images ranging from 19 to 39 weeks in gestational age. We also demonstrate model performance and generalizability using independent datasets from four academic institutions across the U.S. and Turkey with R2 scores of 0.81-0.90 after minimal fine-tuning. The proposed regression algorithm provides an automated machine-enabled tool with the potential to better characterize in utero neurodevelopment and guide real-time gestational age estimation after the first trimester.
Project description:With the rapid improvement of machine translation approaches, neural machine translation has started to play an important role in retrosynthesis planning, which finds reasonable synthetic pathways for a target molecule. Previous studies showed that utilizing the sequence-to-sequence frameworks of neural machine translation is a promising approach to tackle the retrosynthetic planning problem. In this work, we recast the retrosynthetic planning problem as a language translation problem using a template-free sequence-to-sequence model. The model is trained in an end-to-end and a fully data-driven fashion. Unlike previous models translating the SMILES strings of reactants and products, we introduced a new way of representing a chemical reaction based on molecular fragments. It is demonstrated that the new approach yields better prediction results than current state-of-the-art computational methods. The new approach resolves the major drawbacks of existing retrosynthetic methods such as generating invalid SMILES strings. Specifically, our approach predicts highly similar reactant molecules with an accuracy of 57.7%. In addition, our method yields more robust predictions than existing methods.
Project description:Despite recent advances of data acquisition and algorithms development, machine learning (ML) faces tremendous challenges to being adopted in practical catalyst design, largely due to its limited generalizability and poor explainability. Herein, we develop a theory-infused neural network (TinNet) approach that integrates deep learning algorithms with the well-established d-band theory of chemisorption for reactivity prediction of transition-metal surfaces. With simple adsorbates (e.g., *OH, *O, and *N) at active site ensembles as representative descriptor species, we demonstrate that the TinNet is on par with purely data-driven ML methods in prediction performance while being inherently interpretable. Incorporation of scientific knowledge of physical interactions into learning from data sheds further light on the nature of chemical bonding and opens up new avenues for ML discovery of novel motifs with desired catalytic properties.
Project description:BackgroundAccurate prediction of protein-ligand binding affinity is important for lowering the overall cost of drug discovery in structure-based drug design. For accurate predictions, many classical scoring functions and machine learning-based methods have been developed. However, these techniques tend to have limitations, mainly resulting from a lack of sufficient energy terms to describe the complex interactions between proteins and ligands. Recent deep-learning techniques can potentially solve this problem. However, the search for more efficient and appropriate deep-learning architectures and methods to represent protein-ligand complex is ongoing.ResultsIn this study, we proposed a deep-neural network model to improve the prediction accuracy of protein-ligand complex binding affinity. The proposed model has two important features, descriptor embeddings with information on the local structures of a protein-ligand complex and an attention mechanism to highlight important descriptors for binding affinity prediction. The proposed model performed better than existing binding affinity prediction models on most benchmark datasets.ConclusionsWe confirmed that an attention mechanism can capture the binding sites in a protein-ligand complex to improve prediction performance. Our code is available at https://github.com/Blue1993/BAPA .
Project description:BackgroundDue to the complexity of the biological systems, the prediction of the potential DNA binding sites for transcription factors remains a difficult problem in computational biology. Genomic DNA sequences and experimental results from parallel sequencing provide available information about the affinity and accessibility of genome and are commonly used features in binding sites prediction. The attention mechanism in deep learning has shown its capability to learn long-range dependencies from sequential data, such as sentences and voices. Until now, no study has applied this approach in binding site inference from massively parallel sequencing data. The successful applications of attention mechanism in similar input contexts motivate us to build and test new methods that can accurately determine the binding sites of transcription factors.ResultsIn this study, we propose a novel tool (named DeepGRN) for transcription factors binding site prediction based on the combination of two components: single attention module and pairwise attention module. The performance of our methods is evaluated on the ENCODE-DREAM in vivo Transcription Factor Binding Site Prediction Challenge datasets. The results show that DeepGRN achieves higher unified scores in 6 of 13 targets than any of the top four methods in the DREAM challenge. We also demonstrate that the attention weights learned by the model are correlated with potential informative inputs, such as DNase-Seq coverage and motifs, which provide possible explanations for the predictive improvements in DeepGRN.ConclusionsDeepGRN can automatically and effectively predict transcription factor binding sites from DNA sequences and DNase-Seq coverage. Furthermore, the visualization techniques we developed for the attention modules help to interpret how critical patterns from different types of input features are recognized by our model.
Project description:BackgroundLow-dose computed tomography (LDCT) scans can effectively reduce the radiation damage to patients, but this is highly detrimental to CT image quality. Deep convolutional neural networks (CNNs) have shown their potential in improving LDCT image quality. However, the conventional CNN-based approaches rely fundamentally on the convolution operations, which are ineffective for modeling the correlations among nonlocal similar structures and the regionally distinct statistical properties in CT images. This modeling deficiency hampers the denoising performance for CT images derived in this manner.MethodsIn this paper, we propose an adaptive global context (AGC) modeling scheme to describe the nonlocal correlations and the regionally distinct statistics in CT images with negligible computation load. We further propose an AGC-based long-short residual encoder-decoder (AGC-LSRED) network for efficient LDCT image noise artifact-suppression tasks. Specifically, stacks of residual AGC attention blocks (RAGCBs) with long and short skip connections are constructed in the AGC-LSRED network, which allows valuable structural and positional information to be bypassed through these identity-based skip connections and thus eases the training of the deep denoising network. For training the AGC-LSRED network, we propose a compound loss that combines the L1 loss, adversarial loss, and self-supervised multi-scale perceptual loss.ResultsQuantitative and qualitative experimental studies were performed to verify and validate the effectiveness of the proposed method. The simulation experiments demonstrated the proposed method exhibits the best result in terms of noise suppression [root-mean-square error (RMSE) =9.02; peak signal-to-noise ratio (PSNR) =33.17] and fine structure preservation [structural similarity index (SSIM) =0.925] compared with other competitive CNN-based methods. The experiments on real data illustrated that the proposed method has advantages over other methods in terms of radiologists' subjective assessment scores (averaged scores =4.34).ConclusionsWith the use of the AGC modeling scheme to characterize the structural information in CT images and of residual AGC-attention blocks with long and short skip connections to ease the network training, the proposed AGC-LSRED method achieves satisfactory results in preserving fine anatomical structures and suppressing noise in LDCT images.
Project description:Drug-target interaction (DTI) prediction has drawn increasing interest due to its substantial position in the drug discovery process. Many studies have introduced computational models to treat DTI prediction as a regression task, which directly predict the binding affinity of drug-target pairs. However, existing studies (i) ignore the essential correlations between atoms when encoding drug compounds and (ii) model the interaction of drug-target pairs simply by concatenation. Based on those observations, in this study, we propose an end-to-end model with multiple attention blocks to predict the binding affinity scores of drug-target pairs. Our proposed model offers the abilities to (i) encode the correlations between atoms by a relation-aware self-attention block and (ii) model the interaction of drug representations and target representations by the multi-head attention block. Experimental results of DTI prediction on two benchmark datasets show our approach outperforms existing methods, which are benefit from the correlation information encoded by the relation-aware self-attention block and the interaction information extracted by the multi-head attention block. Moreover, we conduct the experiments on the effects of max relative position length and find out the best max relative position length value $k \in \{3, 5\}$. Furthermore, we apply our model to predict the binding affinity of Corona Virus Disease 2019 (COVID-19)-related genome sequences and $3137$ FDA-approved drugs.
Project description:Quality assessment is essential for the computational prediction and design of RNA tertiary structures. To date, several knowledge-based statistical potentials have been proposed and proved to be effective in identifying native and near-native RNA structures. All these potentials are based on the inverse Boltzmann formula, while differing in the choice of the geometrical descriptor, reference state, and training dataset. Via an approach that diverges completely from the conventional statistical potentials, our work explored the power of a 3D convolutional neural network (CNN)-based approach as a quality evaluator for RNA 3D structures, which used a 3D grid representation of the structure as input without extracting features manually. The RNA structures were evaluated by examining each nucleotide, so our method can also provide local quality assessment. Two sets of training samples were built. The first one included 1 million samples generated by high-temperature molecular dynamics (MD) simulations and the second one included 1 million samples generated by Monte Carlo (MC) structure prediction. Both MD and MC procedures were performed for a non-redundant set of 414 RNAs. For two training datasets (one including only MD training samples and the other including both MD and MC training samples), we trained two neural networks, named RNA3DCNN_MD and RNA3DCNN_MDMC, respectively. The former is suitable for assessing near-native structures, while the latter is suitable for assessing structures covering large structural space. We tested the performance of our method and made comparisons with four other traditional scoring functions. On two of three test datasets, our method performed similarly to the state-of-the-art traditional scoring function, and on the third test dataset, our method was far superior to other scoring functions. Our method can be downloaded from https://github.com/lijunRNA/RNA3DCNN.
Project description:Protein Blocks (PBs) are a widely used structural alphabet describing local protein backbone conformation in terms of 16 possible conformational states, adopted by five consecutive amino acids. The representation of complex protein 3D structures as 1D PB sequences was previously successfully applied to protein structure alignment and protein structure prediction. In the current study, we present a new model, PYTHIA (predicting any conformation at high accuracy), for the prediction of the protein local conformations in terms of PBs directly from the amino acid sequence. PYTHIA is based on a deep residual inception-inside-inception neural network with convolutional block attention modules, predicting 1 of 16 PB classes from evolutionary information combined to physicochemical properties of individual amino acids. PYTHIA clearly outperforms the LOCUSTRA reference method for all PB classes and demonstrates great performance for PB prediction on particularly challenging proteins from the CASP14 free modelling category.