Dataset Information

Representing Multiword Chemical Terms through Phrase-Level Preprocessing and Word Embedding.

ABSTRACT: In recent years, data-driven methods and artificial intelligence have been widely used in chemoinformatic and material informatics domains, for which the success is critically determined by the availability of training data with good quality and large quantity. A potential approach to break this bottleneck is by leveraging the chemical literature such as papers and patents as alternative data resources to high throughput experiments and simulation. Compared to other domains where natural language processing techniques have established successes, the chemical literature contains a large portion of phrases of multiple words that create additional challenges for accurate identification and representation. Here, we introduce a chemistry domain suitable approach to identify multiword chemical terms and train word representations at the phrase level. Through a series of special-designed experiments, we demonstrate that our multiword identifying and representing method effectively and accurately identifies multiword chemical terms from 119, 166 chemical patents and is more robust and precise to preserve the semantic meaning of chemical phrases compared to the conventional approach, which represents constituent single words first and combine them afterward. Because the accurate representation of chemical terms is the first and essential step to provide learning features for downstream natural language processing tasks, our results pave the road to utilize the large volume of chemical literature in future data-driven studies.

SUBMITTER: Huang L

PROVIDER: S-EPMC6854573 | biostudies-literature | 2019 Nov

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Representing Multiword Chemical Terms through Phrase-Level Preprocessing and Word Embedding.

Huang Liyuan L Ling Chen C

ACS omega 20191031 20

In recent years, data-driven methods and artificial intelligence have been widely used in chemoinformatic and material informatics domains, for which the success is critically determined by the availability of training data with good quality and large quantity. A potential approach to break this bottleneck is by leveraging the chemical literature such as papers and patents as alternative data resources to high throughput experiments and simulation. Compared to other domains where natural languag ...[more]

PMID: 31737809

Dataset Information

Representing Multiword Chemical Terms through Phrase-Level Preprocessing and Word Embedding.

Publications

Representing Multiword Chemical Terms through Phrase-Level Preprocessing and Word Embedding.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

Word and Sentence Embedding Tools to Measure Semantic Similarity of Gene Ontology Terms by Their Definitions.
| S-EPMC6350067 | biostudies-literature

Measuring novelty in science with word embedding.
| S-EPMC8253414 | biostudies-literature

Identify novel elements of knowledge with word embedding.
| S-EPMC10281565 | biostudies-literature

FrameAxis: characterizing microframe bias and intensity with word embedding.
| S-EPMC8323720 | biostudies-literature

Joint embedding VQA model based on dynamic word vector.
| S-EPMC7959642 | biostudies-literature

Impact analysis of keyword extraction using contextual word embedding.
| S-EPMC9202614 | biostudies-literature

Unsupervised Word Embedding Learning by Incorporating Local and Global Contexts.
| S-EPMC7931948 | biostudies-literature

Integrating topic modeling and word embedding to characterize violent deaths.
| S-EPMC8915886 | biostudies-literature

Speech Development Between 30 and 119 Months in Typical Children I: Intelligibility Growth Curves for Single-Word and Multiword Productions.
| S-EPMC9132140 | biostudies-literature