Browse
Submit Data
Databases
API
Help

Dataset Information

27 Views

0 Connections

0 Citations

0 Reanalyses

0 Downloads

Omics score: 0

Translating the InChI: adapting neural machine translation to predict IUPAC names from a chemical identifier.

ABSTRACT: We present a sequence-to-sequence machine learning model for predicting the IUPAC name of a chemical from its standard International Chemical Identifier (InChI). The model uses two stacks of transformers in an encoder-decoder architecture, a setup similar to the neural networks used in state-of-the-art machine translation. Unlike neural machine translation, which usually tokenizes input and output into words or sub-words, our model processes the InChI and predicts the IUPAC name character by character. The model was trained on a dataset of 10 million InChI/IUPAC name pairs freely downloaded from the National Library of Medicine's online PubChem service. Training took seven days on a Tesla K80 GPU, and the model achieved a test set accuracy of 91%. The model performed particularly well on organics, with the exception of macrocycles, and was comparable to commercial IUPAC name generation software. The predictions were less accurate for inorganic and organometallic compounds. This can be explained by inherent limitations of standard InChI for representing inorganics, as well as low coverage in the training data.

SUBMITTER: Handsel J

PROVIDER: S-EPMC8496104 | biostudies-literature |

REPOSITORIES: biostudies-literature

ACCESS DATA

Json Xml

Similar Datasets

Exploratory machine-learned theoretical chemical shifts can closely predict metabolic mixture signals.

Project description:Various chemical shift predictive methodologies have been studied and developed, but there remains the problem of prediction accuracy. Assigning the NMR signals of metabolic mixtures requires high predictive performance owing to the complexity of the signals. Here we propose a new predictive tool that combines quantum chemistry and machine learning. A scaling factor as the objective variable to correct the errors of 2355 theoretical chemical shifts was optimized by exploring 91 machine learning algorithms and using the partial structure of 150 compounds as explanatory variables. The optimal predictive model gave RMSDs between experimental and predicted chemical shifts of 0.2177 ppm for δ 1H and 3.3261 ppm for δ 13C in the test data; thus, better accuracy was achieved compared with existing empirical and quantum chemical methods. The utility of the predictive model was demonstrated by applying it to assignments of experimental NMR signals of a complex metabolic mixture.

| S-EPMC6240814 | biostudies-other

Neural machine translation of chemical nomenclature between English and Chinese.

Project description:Machine translation of chemical nomenclature has considerable application prospect in chemical text data processing between languages. However, rule based machine translation tools have to face significant complication in rule sets building, especially in translation of chemical names between English and Chinese, which are the two most used languages of chemical nomenclature in the world. We applied two types of neural networks in the task of chemical nomenclature translation between English and Chinese, and made a comparison with an existing rule based machine translation tool. The result shows that deep learning based approaches have a great chance to precede rule based translation tools in machine translation of chemical nomenclature between English and Chinese.

| S-EPMC7460765 | biostudies-literature

Translating Translation to Mechanisms of Cardiac Hypertrophy.

Project description:Cardiac hypertrophy in response to chronic pathological stress is a common feature occurring with many forms of heart disease. This pathological hypertrophic growth increases the risk for arrhythmias and subsequent heart failure. While several factors promoting cardiac hypertrophy are known, the molecular mechanisms governing the progression to heart failure are incompletely understood. Recent studies on altered translational regulation during pathological cardiac hypertrophy are contributing to our understanding of disease progression. In this brief review, we describe how the translational machinery is modulated for enhanced global and transcript selective protein synthesis, and how alternative modes of translation contribute to the disease state. Attempts at controlling translational output through targeting of mTOR and its regulatory components are detailed, as well as recently emerging targets for pre-clinical investigation.

| S-EPMC7151157 | biostudies-literature

CHEMDNER: The drugs and chemical names extraction challenge.

Project description:Natural language processing (NLP) and text mining technologies for the chemical domain (ChemNLP or chemical text mining) are key to improve the access and integration of information from unstructured data such as patents or the scientific literature. Therefore, the BioCreative organizers posed the CHEMDNER (chemical compound and drug name recognition) community challenge, which promoted the development of novel, competitive and accessible chemical text mining systems. This task allowed a comparative assessment of the performance of various methodologies using a carefully prepared collection of manually labeled text prepared by specially trained chemists as Gold Standard data. We evaluated two important aspects: one covered the indexing of documents with chemicals (chemical document indexing - CDI task), and the other was concerned with finding the exact mentions of chemicals in text (chemical entity mention recognition - CEM task). 27 teams (23 academic and 4 commercial, a total of 87 researchers) returned results for the CHEMDNER tasks: 26 teams for CEM and 23 for the CDI task. Top scoring teams obtained an F-score of 87.39% for the CEM task and 88.20% for the CDI task, a very promising result when compared to the agreement between human annotators (91%). The strategies used to detect chemicals included machine learning methods (e.g. conditional random fields) using a variety of features, chemistry and drug lexica, and domain-specific rules. We expect that the tools and resources resulting from this effort will have an impact in future developments of chemical text mining applications and will form the basis to find related chemical information for the detected entities, such as toxicological or pharmacogenomic properties.

| S-EPMC4331685 | biostudies-literature

UniChem: a unified chemical structure cross-referencing and identifier tracking system.

Project description:UniChem is a freely available compound identifier mapping service on the internet, designed to optimize the efficiency with which structure-based hyperlinks may be built and maintained between chemistry-based resources. In the past, the creation and maintenance of such links at EMBL-EBI, where several chemistry-based resources exist, has required independent efforts by each of the separate teams. These efforts were complicated by the different data models, release schedules, and differing business rules for compound normalization and identifier nomenclature that exist across the organization. UniChem, a large-scale, non-redundant database of Standard InChIs with pointers between these structures and chemical identifiers from all the separate chemistry resources, was developed as a means of efficiently sharing the maintenance overhead of creating these links. Thus, for each source represented in UniChem, all links to and from all other sources are automatically calculated and immediately available for all to use. Updated mappings are immediately available upon loading of new data releases from the sources. Web services in UniChem provide users with a single simple automatable mechanism for maintaining all links from their resource to all other sources represented in UniChem. In addition, functionality to track changes in identifier usage allows users to monitor which identifiers are current, and which are obsolete. Lastly, UniChem has been deliberately designed to allow additional resources to be included with minimal effort. Indeed, the recent inclusion of data sources external to EMBL-EBI has provided a simple means of providing users with an even wider selection of resources with which to link to, all at no extra cost, while at the same time providing a simple mechanism for external resources to link to all EMBL-EBI chemistry resources.

| S-EPMC3616875 | biostudies-literature

A Meta-Model to Predict the Drag Coefficient of a Particle Translating in Viscoelastic Fluids: A Machine Learning Approach.

Project description:This study presents a framework based on Machine Learning (ML) models to predict the drag coefficient of a spherical particle translating in viscoelastic fluids. For the purpose of training and testing the ML models, two datasets were generated using direct numerical simulations (DNSs) for the viscoelastic unbounded flow of Oldroyd-B (OB-set containing 12,120 data points) and Giesekus (GI-set containing 4950 data points) fluids past a spherical particle. The kinematic input features were selected to be Reynolds number, 0<Re≤50, Weissenberg number, 0≤Wi≤10, polymeric retardation ratio, 0<ζ<1, and shear thinning mobility parameter, 0<α<1. The ML models, specifically Random Forest (RF), Deep Neural Network (DNN) and Extreme Gradient Boosting (XGBoost), were all trained, validated, and tested, and their best architecture was obtained using a 10-Fold cross-validation method. All the ML models presented remarkable accuracy on these datasets; however the XGBoost model resulted in the highest R2 and the lowest root mean square error (RMSE) and mean absolute percentage error (MAPE) measures. Additionally, a blind dataset was generated using DNSs, where the input feature coverage was outside the scope of the training set or interpolated within the training sets. The ML models were tested against this blind dataset, to further assess their generalization capability. The DNN model achieved the highest R2 and the lowest RMSE and MAPE measures when inferred on this blind dataset. Finally, we developed a meta-model using stacking technique to ensemble RF, XGBoost and DNN models and output a prediction based on the individual learner's predictions and a DNN meta-regressor. The meta-model consistently outperformed the individual models on all datasets.

| S-EPMC8838701 | biostudies-literature

DRDB: A Machine Learning Platform to Predict Chemical-Protein Interactions towards Diabetic Retinopathy.

Project description:Diabetic retinopathy (DR), a diabetic microangiopathy caused by diabetes, affects approximately 93 million people, worldwide. However, the drugs used to treat DR have limited efficacy and the variety of side effects. This is possibly because the complicated pathogenesis of DR is associated with multiple proteins. In this work, we attempted to identify potential drugs against DR-associated proteins and predict potential targets for drugs using in silico prediction of chemical-protein interactions (CPI) based on multitarget quantitative structure-activity relationship (mt-QSAR) method. Therefore, we developed 128 binary classifiers to predict the CPI for 15 DR targets using random forest (RF), k-nearest neighbours (KNN), support vector machine (SVM), and neural network (NN) algorithms with MACCS, extended connectivity fingerprints (ECFP6) fingerprints, and protein descriptors. In order to facilitate discovery of the novel drugs and target identification using the 128 binary classifiers, a free web server (DRDB) was developed. Compound Danshen Dripping Pills (CDDP), composed of Salvia miltiorrhiza, Panax notoginseng, and borneol, is commonly used in the treatment of cardiovascular diseases. To explore the applicability of DRDB, the potential CPIs of CDDP in treatment of DR were investigated based on DRDB. In vitro experimental validation demonstrated that cryptotanshinone and protocatechuic acid, two key components of CDDP, are capable of targeting ICAM-1 which is one of the key target of DR. We hope that this work can facilitate development of more effective clinical strategies for the treatment of DR.

| S-EPMC9329024 | biostudies-literature

Adapting machine-learning algorithms to design gene circuits.

Project description:BackgroundGene circuits are important in many aspects of biology, and perform a wide variety of different functions. For example, some circuits oscillate (e.g. the cell cycle), some are bistable (e.g. as cells differentiate), some respond sharply to environmental signals (e.g. ultrasensitivity), and some pattern multicellular tissues (e.g. Turing's model). Often, one starts from a given circuit, and using simulations, asks what functions it can perform. Here we want to do the opposite: starting from a prescribed function, can we find a circuit that executes this function? Whilst simple in principle, this task is challenging from a computational perspective, since gene circuit models are complex systems with many parameters. In this work, we adapted machine-learning algorithms to significantly accelerate gene circuit discovery.ResultsWe use gradient-descent optimization algorithms from machine learning to rapidly screen and design gene circuits. With this approach, we found that we could rapidly design circuits capable of executing a range of different functions, including those that: (1) recapitulate important in vivo phenomena, such as oscillators, and (2) perform complex tasks for synthetic biology, such as counting noisy biological events.ConclusionsOur computational pipeline will facilitate the systematic study of natural circuits in a range of contexts, and allow the automatic design of circuits for synthetic biology. Our method can be readily applied to biological networks of any type and size, and is provided as an open-source and easy-to-use python module, GeneNet.

| S-EPMC6487017 | biostudies-literature

Translating Metaphtonymy: Exploring Trainee Translators' Translation Approaches and Underlying Factors.

Project description:Metaphtonymy is identified as a special rhetoric figure that specifies the interaction between metaphor and metonymy and which is pervasive in literary works. How and why do trainee translators translate metaphtonymy? Using task analysis, semi-structured discourse-based interviews, and a questionnaire survey among 30 master of translation and interpreting (MTI) trainee translators, this study investigates their translation approaches adopted when translating the metaphtonymies in Chinese extracted prose and explores the effects of their choices. It is found that they mainly employed three approaches: omission, modification, and retainment, with omission being the most, and retainment the least frequent. The main factors attributing to each approach range from the prominence degrees and cross-cultural adaptation abilities of the metaphtonymies, rhetorical awareness of translators, and transference competence to their translation knowledge sub-competence. This study suggests that trainee translators should be instructed to systematically construct rhetoric knowledge, and the teaching design should emphasize the competence of trainees of identifying rhetorical devices and their competence of shifting rhetoric between languages.

| S-EPMC8278523 | biostudies-literature

PSBP-SVM: A Machine Learning-Based Computational Identifier for Predicting Polystyrene Binding Peptides.

Project description:Polystyrene binding peptides (PSBPs) play a key role in the immobilization process. The correct identification of PSBPs is the first step of all related works. In this paper, we proposed a novel support vector machine-based bioinformatic identification model. This model contains four machine learning steps, including feature extraction, feature selection, model training and optimization. In a five-fold cross validation test, this model achieves 90.38, 84.62, 87.50, and 0.90% SN, SP, ACC, and AUC, respectively. The performance of this model outperforms the state-of-the-art identifier in terms of the SN and ACC with a smaller feature set. Furthermore, we constructed a web server that includes the proposed model, which is freely accessible at http://server.malab.cn/PSBP-SVM/index.jsp.

| S-EPMC7137786 | biostudies-literature

OmicsDI is part of the ELIXIR infrastructure

OmicsDI is an Elixir interoperability service. Learn more ›

Tweets

OmicsDI Databases

PRIDE
PeptideAtlas
MassIVE
JPOST Repository
Physiome Model Repository

EGA
EVA
ENA
LINCS
PAXDB
Cell Collective

MetaboLights
Metabolomics Workbench
MetabolomeExpress
GNPS
BioModels
FAIRDOMHub

ArrayExpress
dbGaP
ExpressionAtlas
GEO
NODE

Information

Databases
Help
API
Contact us
Code on GitHub
Terms of use
Submit Data