Unknown

Dataset Information

0

Epidemiologic information discovery from open-access COVID-19 case reports via pretrained language model.


ABSTRACT: Although open-access data are increasingly common and useful to epidemiological research, the curation of such datasets is resource-intensive and time-consuming. Despite the existence of a major source of COVID-19 data, the regularly disclosed case reports were often written in natural language with an unstructured format. Here, we propose a computational framework that can automatically extract epidemiological information from open-access COVID-19 case reports. We develop this framework by coupling a language model developed using deep neural networks with training samples compiled using an optimized data annotation strategy. When applied to the COVID-19 case reports collected from mainland China, our framework outperforms all other state-of-the-art deep learning models. The information extracted from our approach is highly consistent with that obtained from the gold-standard manual coding, with a matching rate of 80%. To disseminate our algorithm, we provide an open-access online platform that is able to estimate key epidemiological statistics in real time, with much less effort for data curation.

SUBMITTER: Wang Z 

PROVIDER: S-EPMC9441477 | biostudies-literature | 2022 Oct

REPOSITORIES: biostudies-literature

altmetric image

Publications

Epidemiologic information discovery from open-access COVID-19 case reports via pretrained language model.

Wang Zhizheng Z   Liu Xiao Fan XF   Du Zhanwei Z   Wang Lin L   Wu Ye Y   Holme Petter P   Lachmann Michael M   Lin Hongfei H   Wong Zoie S Y ZSY   Xu Xiao-Ke XK   Sun Yuanyuan Y  

iScience 20220905 10


Although open-access data are increasingly common and useful to epidemiological research, the curation of such datasets is resource-intensive and time-consuming. Despite the existence of a major source of COVID-19 data, the regularly disclosed case reports were often written in natural language with an unstructured format. Here, we propose a computational framework that can automatically extract epidemiological information from open-access COVID-19 case reports. We develop this framework by coup  ...[more]

Similar Datasets

| S-EPMC9280463 | biostudies-literature
| S-EPMC4926810 | biostudies-literature
| S-EPMC10791738 | biostudies-literature
| S-EPMC10792687 | biostudies-literature
| S-EPMC9795558 | biostudies-literature
| S-EPMC11913902 | biostudies-literature
| S-EPMC11339500 | biostudies-literature
| S-EPMC5536731 | biostudies-other
| S-EPMC2268932 | biostudies-literature
| S-EPMC8096411 | biostudies-literature