Unknown

Dataset Information

0

Optimising chemical named entity recognition with pre-processing analytics, knowledge-rich features and heuristics.


ABSTRACT:

Background

The development of robust methods for chemical named entity recognition, a challenging natural language processing task, was previously hindered by the lack of publicly available, large-scale, gold standard corpora. The recent public release of a large chemical entity-annotated corpus as a resource for the CHEMDNER track of the Fourth BioCreative Challenge Evaluation (BioCreative IV) workshop greatly alleviated this problem and allowed us to develop a conditional random fields-based chemical entity recogniser. In order to optimise its performance, we introduced customisations in various aspects of our solution. These include the selection of specialised pre-processing analytics, the incorporation of chemistry knowledge-rich features in the training and application of the statistical model, and the addition of post-processing rules.

Results

Our evaluation shows that optimal performance is obtained when our customisations are integrated into the chemical entity recogniser. When its performance is compared with that of state-of-the-art methods, under comparable experimental settings, our solution achieves competitive advantage. We also show that our recogniser that uses a model trained on the CHEMDNER corpus is suitable for recognising names in a wide range of corpora, consistently outperforming two popular chemical NER tools.

Conclusion

The contributions resulting from this work are two-fold. Firstly, we present the details of a chemical entity recognition methodology that has demonstrated performance at a competitive, if not superior, level as that of state-of-the-art methods. Secondly, the developed suite of solutions has been made publicly available as a configurable workflow in the interoperable text mining workbench Argo. This allows interested users to conveniently apply and evaluate our solutions in the context of other chemical text mining tasks.

SUBMITTER: Batista-Navarro R 

PROVIDER: S-EPMC4331696 | biostudies-literature | 2015

REPOSITORIES: biostudies-literature

altmetric image

Publications

Optimising chemical named entity recognition with pre-processing analytics, knowledge-rich features and heuristics.

Batista-Navarro Riza R   Rak Rafal R   Ananiadou Sophia S  

Journal of cheminformatics 20150119 Suppl 1 Text mining for chemistry and the CHEMDNER trac


<h4>Background</h4>The development of robust methods for chemical named entity recognition, a challenging natural language processing task, was previously hindered by the lack of publicly available, large-scale, gold standard corpora. The recent public release of a large chemical entity-annotated corpus as a resource for the CHEMDNER track of the Fourth BioCreative Challenge Evaluation (BioCreative IV) workshop greatly alleviated this problem and allowed us to develop a conditional random fields  ...[more]

Similar Datasets

| S-EPMC7872256 | biostudies-literature
| S-EPMC6913757 | biostudies-literature
| S-EPMC3066171 | biostudies-literature
| S-EPMC6956779 | biostudies-literature
| S-EPMC7485218 | biostudies-literature
| S-EPMC5737072 | biostudies-literature
| S-EPMC6247938 | biostudies-literature
| S-EPMC8242017 | biostudies-literature
| S-EPMC6798575 | biostudies-literature
| S-EPMC5018376 | biostudies-literature