Unknown

Dataset Information

0

Extracting geographic locations from the literature for virus phylogeography using supervised and distant supervision methods.


ABSTRACT: The field of phylogeography allows researchers to model the spread and evolution of viral genetic sequences. Phylogeography plays a major role in infectious disease surveillance, viral epidemiology and vaccine design. When conducting viral phylogeographic studies, researchers require the location of the infected host of the virus, which is often present in public databases such as GenBank. However, the geographic metadata in most GenBank records is not precise enough for many phylogeographic studies; therefore, researchers often need to search the articles linked to the records for more information, which can be a tedious process. Here, we describe two approaches for automatically detecting geographic location mentions in articles pertaining to virus-related GenBank records: a supervised sequence labeling approach with innovative features and a distant-supervision approach with novel noise- reduction methods. Evaluated on a manually annotated gold standard, our supervised sequence labeling and distant supervision approaches attained F-scores of 0.81 and 0.66, respectively.

SUBMITTER: Weissenbacher D 

PROVIDER: S-EPMC5543364 | biostudies-literature | 2017

REPOSITORIES: biostudies-literature

altmetric image

Publications

Extracting geographic locations from the literature for virus phylogeography using supervised and distant supervision methods.

Weissenbacher Davy D   Sarker Abeed A   Tahsin Tasnia T   Scotch Matthew M   Gonzalez Graciela G  

AMIA Joint Summits on Translational Science proceedings. AMIA Joint Summits on Translational Science 20170726


The field of phylogeography allows researchers to model the spread and evolution of viral genetic sequences. Phylogeography plays a major role in infectious disease surveillance, viral epidemiology and vaccine design. When conducting viral phylogeographic studies, researchers require the location of the infected host of the virus, which is often present in public databases such as GenBank. However, the geographic metadata in most GenBank records is not precise enough for many phylogeographic stu  ...[more]

Similar Datasets

| S-EPMC5338769 | biostudies-literature
| S-EPMC3465209 | biostudies-literature
| S-EPMC7148018 | biostudies-literature
| S-EPMC7415240 | biostudies-literature
| S-EPMC4150992 | biostudies-literature
| S-EPMC8138883 | biostudies-literature
| S-EPMC7787447 | biostudies-literature
| S-EPMC7602664 | biostudies-literature
| S-EPMC7480871 | biostudies-literature
| S-EPMC7302801 | biostudies-literature