Dataset Information

A Word Pair Dataset for Semantic Similarity and Relatedness in Korean Medical Vocabulary: Reference Development and Validation.

ABSTRACT:

Background

The fact that medical terms require special expertise and are becoming increasingly complex makes it difficult to employ natural language processing techniques in medical informatics. Several human-validated reference standards for medical terms have been developed to evaluate word embedding models using the semantic similarity and relatedness of medical word pairs. However, there are very few reference standards in non-English languages. In addition, because the existing reference standards were developed a long time ago, there is a need to develop an updated standard to represent recent findings in medical sciences.

Objective

We propose a new Korean word pair reference set to verify embedding models.

Methods

From January 2010 to December 2020, 518 medical textbooks, 72,844 health information news, and 15,698 medical research articles were collected, and the top 10,000 medical terms were selected to develop medical word pairs. Attending physicians (n=16) participated in the verification of the developed set with 607 word pairs.

Results

The proportion of word pairs answered by all participants was 90.8% (551/607) for the similarity task and 86.5% (525/605) for the relatedness task. The similarity and relatedness of the word pair showed a high correlation (ρ=0.70, P<.001). The intraclass correlation coefficients to assess the interrater agreements of the word pair sets were 0.47 on the similarity task and 0.53 on the relatedness task. The final reference standard was 604 word pairs for the similarity task and 599 word pairs for relatedness, excluding word pairs with answers corresponding to outliers and word pairs that were answered by less than 50% of all the respondents. When FastText models were applied to the final reference standard word pair sets, the embedding models learning medical documents had a higher correlation between the calculated cosine similarity scores compared to human-judged similarity and relatedness scores (namu, ρ=0.12 vs with medical text for the similarity task, ρ=0.47; namu, ρ=0.02 vs with medical text for the relatedness task, ρ=0.30).

Conclusions

Korean medical word pair reference standard sets for semantic similarity and relatedness were developed based on medical documents from the past 10 years. It is expected that our word pair reference sets will be actively utilized in the development of medical and multilingual natural language processing technology in the future.

SUBMITTER: Yum Y

PROVIDER: S-EPMC8277378 | biostudies-literature | 2021 Jun

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

A Word Pair Dataset for Semantic Similarity and Relatedness in Korean Medical Vocabulary: Reference Development and Validation.

Yum Yunjin Y Lee Jeong Moon JM Jang Moon Joung MJ Kim Yoojoong Y Kim Jong-Ho JH Kim Seongtae S Shin Unsub U Song Sanghoun S Joo Hyung Joon HJ

JMIR medical informatics 20210624 6

<h4>Background</h4>The fact that medical terms require special expertise and are becoming increasingly complex makes it difficult to employ natural language processing techniques in medical informatics. Several human-validated reference standards for medical terms have been developed to evaluate word embedding models using the semantic similarity and relatedness of medical word pairs. However, there are very few reference standards in non-English languages. In addition, because the existing refe ...[more]

PMID: 34185005

Similar Datasets

Project description:A textual data processing task that involves the automatic extraction of relevant and salient keyphrases from a document that expresses all the important concepts of the document is called keyphrase extraction. Due to technological advancements, the amount of textual information on the Internet is rapidly increasing as a lot of textual information is processed online in various domains such as offices, news portals, or for research purposes. Given the exponential increase of news articles on the Internet, manually searching for similar news articles by reading the entire news content that matches the user's interests has become a time-consuming and tedious task. Therefore, automatically finding similar news articles can be a significant task in text processing. In this context, keyphrase extraction algorithms can extract information from news articles. However, selecting the most appropriate algorithm is also a problem. Therefore, this study analyzes various supervised and unsupervised keyphrase extraction algorithms, namely KEA, KP-Miner, YAKE, MultipartiteRank, TopicRank, and TeKET, which are used to extract keyphrases from news articles. The extracted keyphrases are used to compute lexical and semantic similarity to find similar news articles. The lexical similarity is calculated using the Cosine and Jaccard similarity techniques. In addition, semantic similarity is calculated using a word embedding technique called Word2Vec in combination with the Cosine similarity measure. The experimental results show that the KP-Miner keyphrase extraction algorithm, together with the Cosine similarity calculation using Word2Vec (Cosine-Word2Vec), outperforms the other combinations of keyphrase extraction algorithms and similarity calculation techniques to find similar news articles. The similar articles identified using KPMiner and the Cosine similarity measure with Word2Vec appear to be relevant to a particular news article and thus show satisfactory performance with a Normalized Discounted Cumulative Gain (NDCG) value of 0.97. This study proposes a method for finding similar news articles that can be used in conjunction with other methods already in use.

Project description:Researchers' interest in the learning of vocabulary from word cards has grown alongside the increasing number of studies published on this topic. While meta-analyses or systematic reviews have been previously performed, the types of word cards investigated, and the number of word card studies analyzed were limited. To address these issues, a research synthesis was conducted to provide an inclusive and comprehensive picture of how the use of word cards by learners results in vocabulary learning. A search of the Web of Science and Scopus databases resulted in 803 potential studies, of which 32 aligned with the inclusion criteria. Coding of these studies based on an extensive coding scheme found most studies assessed receptive vocabulary knowledge more often than productive vocabulary knowledge, and knowledge of vocabulary form and meaning were assessed more often than knowledge of vocabulary use. Results of effect size plots showed that more of the reviewed studies showed larger effects for the use of paper word cards than digital word cards, and for the use of ready-made word cards than self-constructed word cards. Results also indicated more studies showed larger effects for using word cards in an intentional learning condition compared with an incidental learning condition, and for using word cards in a massed learning condition compared with a spaced learning condition. Although a correlation was found between time spent using word cards and vocabulary learning outcomes, this correlation was not statistically significant. Learners that were more proficient in English learned more words from using word cards than those less proficient. These results suggest that future researchers should report learner proficiency, adopt reliable tests to assess vocabulary learning outcomes, compare the effectiveness of ready-made word cards and self-constructed word cards, and investigate the learning of different aspects of word knowledge. Teachers should provide learners guidance in how to use word cards and target word selection for self-construction of word cards. In addition, teachers should encourage learners to create word cards for incidentally encountered unknown words and use massed learning when initially working with these new words before using spaced learning for later retrieval practice.

Dataset Information

A Word Pair Dataset for Semantic Similarity and Relatedness in Korean Medical Vocabulary: Reference Development and Validation.

Background

Objective

Methods

Results

Conclusions

Publications

A Word Pair Dataset for Semantic Similarity and Relatedness in Korean Medical Vocabulary: Reference Development and Validation.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets