Ontology highlight
ABSTRACT: Motivation
The extraction of sequence variants from the literature remains an important task. Existing methods primarily target standard (ST) mutation mentions (e.g. 'E6V'), leaving relevant mentions natural language (NL) largely untapped (e.g. 'glutamic acid was substituted by valine at residue 6').Results
We introduced three new corpora suggesting named-entity recognition (NER) to be more challenging than anticipated: 28-77% of all articles contained mentions only available in NL. Our new method nala captured NL and ST by combining conditional random fields with word embedding features learned unsupervised from the entire PubMed. In our hands, nala substantially outperformed the state-of-the-art. For instance, we compared all unique mentions in new discoveries correctly detected by any of three methods (SETH, tmVar, or nala ). Neither SETH nor tmVar discovered anything missed by nala , while nala uniquely tagged 33% mentions. For NL mentions the corresponding value shot up to 100% nala -only.Availability and implementation
Source code, API and corpora freely available at: http://tagtog.net/-corpora/IDP4+ .Contact
nala@rostlab.org.Supplementary information
Supplementary data are available at Bioinformatics online.
SUBMITTER: Cejuela JM
PROVIDER: S-EPMC5870606 | biostudies-literature | 2017 Jun
REPOSITORIES: biostudies-literature
Cejuela Juan Miguel JM Bojchevski Aleksandar A Uhlig Carsten C Bekmukhametov Rustem R Kumar Karn Sanjeev S Mahmuti Shpend S Baghudana Ashish A Dubey Ankit A Satagopam Venkata P VP Rost Burkhard B
Bioinformatics (Oxford, England) 20170601 12
<h4>Motivation</h4>The extraction of sequence variants from the literature remains an important task. Existing methods primarily target standard (ST) mutation mentions (e.g. 'E6V'), leaving relevant mentions natural language (NL) largely untapped (e.g. 'glutamic acid was substituted by valine at residue 6').<h4>Results</h4>We introduced three new corpora suggesting named-entity recognition (NER) to be more challenging than anticipated: 28-77% of all articles contained mentions only available in ...[more]