Dataset Information

Unsupervised low-dimensional vector representations for words, phrases and text that are transparent, scalable, and produce similarity metrics that are not redundant with neural embeddings.

ABSTRACT: Neural embeddings are a popular set of methods for representing words, phrases or text as a low dimensional vector (typically 50-500 dimensions). However, it is difficult to interpret these dimensions in a meaningful manner, and creating neural embeddings requires extensive training and tuning of multiple parameters and hyperparameters. We present here a simple unsupervised method for representing words, phrases or text as a low dimensional vector, in which the meaning and relative importance of dimensions is transparent to inspection. We have created a near-comprehensive vector representation of words, and selected bigrams, trigrams and abbreviations, using the set of titles and abstracts in PubMed as a corpus. This vector is used to create several novel implicit word-word and text-text similarity metrics. The implicit word-word similarity metrics correlate well with human judgement of word pair similarity and relatedness, and outperform or equal all other reported methods on a variety of biomedical benchmarks, including several implementations of neural embeddings trained on PubMed corpora. Our implicit word-word metrics capture different aspects of word-word relatedness than word2vec-based metrics and are only partially correlated (rho?=?0.5-0.8 depending on task and corpus). The vector representations of words, bigrams, trigrams, abbreviations, and PubMed title?+?abstracts are all publicly available from http://arrowsmith.psych.uic.edu/arrowsmith_uic/word_similarity_metrics.html for release under CC-BY-NC license. Several public web query interfaces are also available at the same site, including one which allows the user to specify a given word and view its most closely related terms according to direct co-occurrence as well as different implicit similarity metrics.

SUBMITTER: Smalheiser NR

PROVIDER: S-EPMC6557457 | biostudies-literature | 2019 Feb

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Unsupervised low-dimensional vector representations for words, phrases and text that are transparent, scalable, and produce similarity metrics that are not redundant with neural embeddings.

Smalheiser Neil R NR Cohen Aaron M AM Bonifield Gary G

Journal of biomedical informatics 20190114

Neural embeddings are a popular set of methods for representing words, phrases or text as a low dimensional vector (typically 50-500 dimensions). However, it is difficult to interpret these dimensions in a meaningful manner, and creating neural embeddings requires extensive training and tuning of multiple parameters and hyperparameters. We present here a simple unsupervised method for representing words, phrases or text as a low dimensional vector, in which the meaning and relative importance of ...[more]

PMID: 30654030

Dataset Information

Unsupervised low-dimensional vector representations for words, phrases and text that are transparent, scalable, and produce similarity metrics that are not redundant with neural embeddings.

Publications

Unsupervised low-dimensional vector representations for words, phrases and text that are transparent, scalable, and produce similarity metrics that are not redundant with neural embeddings.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

Zipf's law holds for phrases, not words.
| S-EPMC4531284 | biostudies-literature

Retrofitting Embeddings for Unsupervised User Identity Linkage
| S-EPMC7206306 | biostudies-literature

Testing the stem dominance hypothesis: meaning analysis of inflected words and prepositional phrases.
| S-EPMC3968051 | biostudies-literature

Brain-to-text: decoding spoken phrases from phone representations in the brain.
| S-EPMC4464168 | biostudies-literature

Generalized Shape Metrics on Neural Representations.
| S-EPMC10760997 | biostudies-literature

Social media and bitcoin metrics: which words matter.
| S-EPMC6837202 | biostudies-literature

Phe2vec: Automated disease phenotyping based on unsupervised embeddings from electronic health records.
| S-EPMC8441576 | biostudies-literature

Scalable representations of diseases in biomedical ontologies.
| S-EPMC3102895 | biostudies-literature

Neural representation of words within phrases: Temporal evolution of color-adjectives and object-nouns during simple composition.
| S-EPMC7932185 | biostudies-literature

Scalable aesthetic transparent wood for energy efficient buildings.
| S-EPMC7395769 | biostudies-literature