Unknown

Dataset Information

0

UMLS-based data augmentation for natural language processing of clinical research literature.


ABSTRACT:

Objective

The study sought to develop and evaluate a knowledge-based data augmentation method to improve the performance of deep learning models for biomedical natural language processing by overcoming training data scarcity.

Materials and methods

We extended the easy data augmentation (EDA) method for biomedical named entity recognition (NER) by incorporating the Unified Medical Language System (UMLS) knowledge and called this method UMLS-EDA. We designed experiments to systematically evaluate the effect of UMLS-EDA on popular deep learning architectures for both NER and classification. We also compared UMLS-EDA to BERT.

Results

UMLS-EDA enables substantial improvement for NER tasks from the original long short-term memory conditional random fields (LSTM-CRF) model (micro-F1 score: +5%, + 17%, and +15%), helps the LSTM-CRF model (micro-F1 score: 0.66) outperform LSTM-CRF with transfer learning by BERT (0.63), and improves the performance of the state-of-the-art sentence classification model. The largest gain on micro-F1 score is 9%, from 0.75 to 0.84, better than classifiers with BERT pretraining (0.82).

Conclusions

This study presents a UMLS-based data augmentation method, UMLS-EDA. It is effective at improving deep learning models for both NER and sentence classification, and contributes original insights for designing new, superior deep learning approaches for low-resource biomedical domains.

SUBMITTER: Kang T 

PROVIDER: S-EPMC7973470 | biostudies-literature |

REPOSITORIES: biostudies-literature

Similar Datasets

| S-EPMC1805499 | biostudies-literature
| S-EPMC6894100 | biostudies-literature
| S-EPMC7771189 | biostudies-literature
| S-EPMC2890296 | biostudies-literature
| S-EPMC7526926 | biostudies-literature
| S-EPMC7797509 | biostudies-literature
| S-EPMC9298308 | biostudies-literature