Dataset Information

Recognition of chemical entities: combining dictionary-based and grammar-based approaches.

ABSTRACT: BACKGROUND:The past decade has seen an upsurge in the number of publications in chemistry. The ever-swelling volume of available documents makes it increasingly hard to extract relevant new information from such unstructured texts. The BioCreative CHEMDNER challenge invites the development of systems for the automatic recognition of chemicals in text (CEM task) and for ranking the recognized compounds at the document level (CDI task). We investigated an ensemble approach where dictionary-based named entity recognition is used along with grammar-based recognizers to extract compounds from text. We assessed the performance of ten different commercial and publicly available lexical resources using an open source indexing system (Peregrine), in combination with three different chemical compound recognizers and a set of regular expressions to recognize chemical database identifiers. The effect of different stop-word lists, case-sensitivity matching, and use of chunking information was also investigated. We focused on lexical resources that provide chemical structure information. To rank the different compounds found in a text, we used a term confidence score based on the normalized ratio of the term frequencies in chemical and non-chemical journals. RESULTS:The use of stop-word lists greatly improved the performance of the dictionary-based recognition, but there was no additional benefit from using chunking information. A combination of ChEBI and HMDB as lexical resources, the LeadMine tool for grammar-based recognition, and the regular expressions, outperformed any of the individual systems. On the test set, the F-scores were 77.8% (recall 71.2%, precision 85.8%) for the CEM task and 77.6% (recall 71.7%, precision 84.6%) for the CDI task. Missed terms were mainly due to tokenization issues, poor recognition of formulas, and term conjunctions. CONCLUSIONS:We developed an ensemble system that combines dictionary-based and grammar-based approaches for chemical named entity recognition, outperforming any of the individual systems that we considered. The system is able to provide structure information for most of the compounds that are found. Improved tokenization and better recognition of specific entity types is likely to further improve system performance.

SUBMITTER: Akhondi SA

PROVIDER: S-EPMC4331686 | biostudies-literature | 2015

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Recognition of chemical entities: combining dictionary-based and grammar-based approaches.

Akhondi Saber A SA Hettne Kristina M KM van der Horst Eelke E van Mulligen Erik M EM Kors Jan A JA

Journal of cheminformatics 20150119 Suppl 1 Text mining for chemistry and the CHEMDNER trac

<h4>Background</h4>The past decade has seen an upsurge in the number of publications in chemistry. The ever-swelling volume of available documents makes it increasingly hard to extract relevant new information from such unstructured texts. The BioCreative CHEMDNER challenge invites the development of systems for the automatic recognition of chemicals in text (CEM task) and for ranking the recognized compounds at the document level (CDI task). We investigated an ensemble approach where dictionary ...[more]

PMID: 25810767

Dataset Information

Recognition of chemical entities: combining dictionary-based and grammar-based approaches.

Publications

Recognition of chemical entities: combining dictionary-based and grammar-based approaches.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

Chemical named entities recognition: a review on approaches and applications.
| S-EPMC4022577 | biostudies-literature

LeadMine: a grammar and dictionary driven approach to entity recognition.
| S-EPMC4331695 | biostudies-literature

Improving dictionary-based named entity recognition with deep learning.
| S-EPMC11373323 | biostudies-literature

Combining chemical and genetic approaches to increase drought resistance in plants
2017-08-14 | GSE101488 | GEO

Identification of Novel Chemical Entities for Adenosine Receptor Type 2A Using Molecular Modeling Approaches.
| S-EPMC7179438 | biostudies-literature

Combining chemical and genetic approaches to increase drought resistance in plants.
| S-EPMC5662759 | biostudies-literature

Chemical Entities of Biological Interest: an update.
| S-EPMC2808869 | biostudies-literature

Integrating nursing diagnostic concepts into the medical entities dictionary using the ISO Reference Terminology Model for Nursing Diagnosis.
| S-EPMC181989 | biostudies-literature

Context-aware multi-token concept recognition of biological entities.
| S-EPMC8529713 | biostudies-literature

Combining agent-based, trait-based and demographic approaches to model coral-community dynamics.
| S-EPMC7473774 | biostudies-literature