Unknown

Dataset Information

0

PubMed Phrases, an open set of coherent phrases for searching biomedical literature.


ABSTRACT: In biomedicine, key concepts are often expressed by multiple words (e.g., 'zinc finger protein'). Previous work has shown treating a sequence of words as a meaningful unit, where applicable, is not only important for human understanding but also beneficial for automatic information seeking. Here we present a collection of PubMed® Phrases that are beneficial for information retrieval and human comprehension. We define these phrases as coherent chunks that are logically connected. To collect the phrase set, we apply the hypergeometric test to detect segments of consecutive terms that are likely to appear together in PubMed. These text segments are then filtered using the BM25 ranking function to ensure that they are beneficial from an information retrieval perspective. Thus, we obtain a set of 705,915 PubMed Phrases. We evaluate the quality of the set by investigating PubMed user click data and manually annotating a sample of 500 randomly selected noun phrases. We also analyze and discuss the usage of these PubMed Phrases in literature search.

SUBMITTER: Kim S 

PROVIDER: S-EPMC5996850 | biostudies-literature | 2018 Jun

REPOSITORIES: biostudies-literature

altmetric image

Publications

PubMed Phrases, an open set of coherent phrases for searching biomedical literature.

Kim Sun S   Yeganova Lana L   Comeau Donald C DC   Wilbur W John WJ   Lu Zhiyong Z  

Scientific data 20180612


In biomedicine, key concepts are often expressed by multiple words (e.g., 'zinc finger protein'). Previous work has shown treating a sequence of words as a meaningful unit, where applicable, is not only important for human understanding but also beneficial for automatic information seeking. Here we present a collection of PubMed<sup>®</sup> Phrases that are beneficial for information retrieval and human comprehension. We define these phrases as coherent chunks that are logically connected. To co  ...[more]

Similar Datasets

| S-EPMC3173492 | biostudies-other
| S-EPMC3278758 | biostudies-literature
| S-EPMC10850402 | biostudies-literature
| S-EPMC2850925 | biostudies-literature
| S-EPMC11229257 | biostudies-literature
| S-EPMC3465642 | biostudies-literature
| S-EPMC7951980 | biostudies-literature
| S-EPMC5845379 | biostudies-literature
| S-EPMC8138883 | biostudies-literature
2013-12-23 | E-GEOD-53091 | biostudies-arrayexpress