Dataset Information

State-of-the-art augmented NLP transformer models for direct and single-step retrosynthesis.

ABSTRACT: We investigated the effect of different training scenarios on predicting the (retro)synthesis of chemical compounds using text-like representation of chemical reactions (SMILES) and Natural Language Processing (NLP) neural network Transformer architecture. We showed that data augmentation, which is a powerful method used in image processing, eliminated the effect of data memorization by neural networks and improved their performance for prediction of new sequences. This effect was observed when augmentation was used simultaneously for input and the target data simultaneously. The top-5 accuracy was 84.8% for the prediction of the largest fragment (thus identifying principal transformation for classical retro-synthesis) for the USPTO-50k test dataset, and was achieved by a combination of SMILES augmentation and a beam search algorithm. The same approach provided significantly better results for the prediction of direct reactions from the single-step USPTO-MIT test set. Our model achieved 90.6% top-1 and 96.1% top-5 accuracy for its challenging mixed set and 97% top-5 accuracy for the USPTO-MIT separated set. It also significantly improved results for USPTO-full set single-step retrosynthesis for both top-1 and top-10 accuracies. The appearance frequency of the most abundantly generated SMILES was well correlated with the prediction outcome and can be used as a measure of the quality of reaction prediction.

SUBMITTER: Tetko IV

PROVIDER: S-EPMC7643129 | biostudies-literature | 2020 Nov

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

State-of-the-art augmented NLP transformer models for direct and single-step retrosynthesis.

Tetko Igor V IV Karpov Pavel P Van Deursen Ruud R Godin Guillaume G

Nature communications 20201104 1

We investigated the effect of different training scenarios on predicting the (retro)synthesis of chemical compounds using text-like representation of chemical reactions (SMILES) and Natural Language Processing (NLP) neural network Transformer architecture. We showed that data augmentation, which is a powerful method used in image processing, eliminated the effect of data memorization by neural networks and improved their performance for prediction of new sequences. This effect was observed when ...[more]

PMID: 33149154

Dataset Information

State-of-the-art augmented NLP transformer models for direct and single-step retrosynthesis.

Publications

State-of-the-art augmented NLP transformer models for direct and single-step retrosynthesis.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

Chemoenzymatic multistep retrosynthesis with transformer loops.
| S-EPMC11474389 | biostudies-literature

Single-step retrosynthesis prediction by leveraging commonly preserved substructures.
| S-EPMC10147675 | biostudies-literature

Single-step retrosynthesis prediction via multitask graph representation learning.
| S-EPMC11742932 | biostudies-literature

County augmented transformer for COVID-19 state hospitalizations prediction.
| S-EPMC10282074 | biostudies-literature

Improving the performance of models for one-step retrosynthesis through re-ranking.
| S-EPMC8922884 | biostudies-literature

SiGra: single-cell spatial elucidation through an image-augmented graph transformer.
| S-EPMC10497630 | biostudies-literature

G<sup>2</sup>Retro as a two-step graph generative models for retrosynthesis prediction.
| S-EPMC10229662 | biostudies-literature

Node-Aligned Graph-to-Graph: Elevating Template-free Deep Learning Approaches in Single-Step Retrosynthesis.
| S-EPMC10976575 | biostudies-literature

Unbiasing Retrosynthesis Language Models with Disconnection Prompts.
| S-EPMC10390024 | biostudies-literature

Mouse models of ciliopathies: the state of the art.
| S-EPMC3339824 | biostudies-literature