Dataset Information

A document classifier for medicinal chemistry publications trained on the ChEMBL corpus.

ABSTRACT:

Background

The large increase in the number of scientific publications has fuelled a need for semi- and fully automated text mining approaches in order to assist in the triage process, both for individual scientists and also for larger-scale data extraction and curation into public databases. Here, we introduce a document classifier, which is able to successfully distinguish between publications that are 'ChEMBL-like' (i.e. related to small molecule drug discovery and likely to contain quantitative bioactivity data) and those that are not. The unprecedented size of the medicinal chemistry literature collection, coupled with the advantage of manual curation and mapping to chemistry and biology make the ChEMBL corpus a unique resource for text mining.

Results

The method has been implemented as a data protocol/workflow for both Pipeline Pilot (version 8.5) and KNIME (version 2.9) respectively. Both workflows and models are freely available at: ftp://ftp.ebi.ac.uk/pub/databases/chembl/text-mining. These can be readily modified to include additional keyword constraints to further focus searches.

Conclusions

Large-scale machine learning document classification was shown to be very robust and flexible for this particular application, as illustrated in four distinct text-mining-based use cases. The models are readily available on two data workflow platforms, which we believe will allow the majority of the scientific community to apply them to their own data.

ᅟ

Graphical AbstractMultidimensional scaling analysis applied to document vectors derived from titles and abstracts in different corpora. Notably, there is large overlap between the documents in the different ChEMBL versions and BindingDB, while the background MEDLINE set is largely divergent.

SUBMITTER: Papadatos G

PROVIDER: S-EPMC4158272 | biostudies-literature |

REPOSITORIES: biostudies-literature

ACCESS DATA

Dataset Information

A document classifier for medicinal chemistry publications trained on the ChEMBL corpus.

Background

Results

Conclusions

ᅟ

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

Fragment-to-Lead Medicinal Chemistry Publications in 2020.
| S-EPMC8762670 | biostudies-literature

What's in a Name? Drug Nomenclature and Medicinal Chemistry Trends using INN Publications.
| S-EPMC8154580 | biostudies-literature

Epigenetic Medicinal Chemistry.
| S-EPMC4753546 | biostudies-literature

UPCLASS: a deep learning-based classifier for UniProtKB entry publications.
| S-EPMC7198315 | biostudies-literature

Medicinal chemistry of cannabinoids.
| S-EPMC4918805 | biostudies-literature

Foldamers in Medicinal Chemistry
| S-EPMC7271180 | biostudies-literature

Assessing citation integrity in biomedical publications: corpus annotation and NLP models.
| S-EPMC11231046 | biostudies-literature

Building a semantically annotated corpus for chronic disease complications using two document types.
| S-EPMC7971867 | biostudies-literature

Essential Medicinal Chemistry of Essential Medicines.
| S-EPMC8007110 | biostudies-literature