Unknown

Dataset Information

0

Domain specific word embeddings for natural language processing in radiology.


ABSTRACT:

Background

There has been increasing interest in machine learning based natural language processing (NLP) methods in radiology; however, models have often used word embeddings trained on general web corpora due to lack of a radiology-specific corpus.

Purpose

We examined the potential of Radiopaedia to serve as a general radiology corpus to produce radiology specific word embeddings that could be used to enhance performance on a NLP task on radiological text.

Materials and methods

Embeddings of dimension 50, 100, 200, and 300 were trained on articles collected from Radiopaedia using a GloVe algorithm and evaluated on analogy completion. A shallow neural network using input from either our trained embeddings or pre-trained Wikipedia 2014 + Gigaword 5 (WG) embeddings was used to label the Radiopaedia articles. Labeling performance was evaluated based on exact match accuracy and Hamming loss. The McNemar's test with continuity and the Benjamini-Hochberg correction and a 5×2 cross validation paired two-tailed t-test were used to assess statistical significance.

Results

For accuracy in the analogy task, 50-dimensional (50-D) Radiopaedia embeddings outperformed WG embeddings on tumor origin analogies (p < 0.05) and organ adjectives (p < 0.01) whereas WG embeddings tended to outperform on inflammation location and bone vs. muscle analogies (p < 0.01). The two embeddings had comparable performance on other subcategories. In the labeling task, the Radiopaedia-based model outperformed the WG based model at 50, 100, 200, and 300-D for exact match accuracy (p < 0.001, p < 0.001, p < 0.01, and p < 0.05, respectively) and Hamming loss (p < 0.001, p < 0.001, p < 0.01, and p < 0.05, respectively).

Conclusion

We have developed a set of word embeddings from Radiopaedia and shown that they can preserve relevant medical semantics and augment performance on a radiology NLP task. Our results suggest that the cultivation of a radiology-specific corpus can benefit radiology NLP models in the future.

SUBMITTER: Chen TL 

PROVIDER: S-EPMC7856086 | biostudies-literature | 2021 Jan

REPOSITORIES: biostudies-literature

altmetric image

Publications

Domain specific word embeddings for natural language processing in radiology.

Chen Timothy L TL   Emerling Max M   Chaudhari Gunvant R GR   Chillakuru Yeshwant R YR   Seo Youngho Y   Vu Thienkhai H TH   Sohn Jae Ho JH  

Journal of biomedical informatics 20201215


<h4>Background</h4>There has been increasing interest in machine learning based natural language processing (NLP) methods in radiology; however, models have often used word embeddings trained on general web corpora due to lack of a radiology-specific corpus.<h4>Purpose</h4>We examined the potential of Radiopaedia to serve as a general radiology corpus to produce radiology specific word embeddings that could be used to enhance performance on a NLP task on radiological text.<h4>Materials and metho  ...[more]

Similar Datasets

| S-EPMC6585427 | biostudies-literature
| S-EPMC3811072 | biostudies-literature
| S-EPMC8176715 | biostudies-literature
| S-EPMC9986939 | biostudies-literature
| S-EPMC7686874 | biostudies-literature
| S-EPMC10130381 | biostudies-literature
| S-EPMC6659158 | biostudies-literature
| S-EPMC4586346 | biostudies-literature
| S-EPMC9391313 | biostudies-literature
| S-EPMC7437899 | biostudies-literature