Unknown

Dataset Information

0

A gene-phenotype relationship extraction pipeline from the biomedical literature using a representation learning approach.


ABSTRACT:

Motivation

The fundamental challenge of modern genetic analysis is to establish gene-phenotype correlations that are often found in the large-scale publications. Because lexical features of gene are relatively regular in text, the main challenge of these relation extraction is phenotype recognition. Due to phenotypic descriptions are often study- or author-specific, few lexicon can be used to effectively identify the entire phenotypic expressions in text, especially for plants.

Results

We have proposed a pipeline for extracting phenotype, gene and their relations from biomedical literature. Combined with abbreviation revision and sentence template extraction, we improved the unsupervised word-embedding-to-sentence-embedding cascaded approach as representation learning to recognize the various broad phenotypic information in literature. In addition, the dictionary- and rule-based method was applied for gene recognition. Finally, we integrated one of famous information extraction system OLLIE to identify gene-phenotype relations. To demonstrate the applicability of the pipeline, we established two types of comparison experiment using model organism Arabidopsis thaliana. In the comparison of state-of-the-art baselines, our approach obtained the best performance (F1-Measure of 66.83%). We also applied the pipeline to 481 full-articles from TAIR gene-phenotype manual relationship dataset to prove the validity. The results showed that our proposed pipeline can cover 70.94% of the original dataset and add 373 new relations to expand it.

Availability and implementation

The source code is available at http://www.wutbiolab.cn: 82/Gene-Phenotype-Relation-Extraction-Pipeline.zip.

Supplementary information

Supplementary data are available at Bioinformatics online.

SUBMITTER: Xing W 

PROVIDER: S-EPMC6022650 | biostudies-literature | 2018 Jul

REPOSITORIES: biostudies-literature

altmetric image

Publications

A gene-phenotype relationship extraction pipeline from the biomedical literature using a representation learning approach.

Xing Wenhui W   Qi Junsheng J   Yuan Xiaohui X   Li Lin L   Zhang Xiaoyu X   Fu Yuhua Y   Xiong Shengwu S   Hu Lun L   Peng Jing J  

Bioinformatics (Oxford, England) 20180701 13


<h4>Motivation</h4>The fundamental challenge of modern genetic analysis is to establish gene-phenotype correlations that are often found in the large-scale publications. Because lexical features of gene are relatively regular in text, the main challenge of these relation extraction is phenotype recognition. Due to phenotypic descriptions are often study- or author-specific, few lexicon can be used to effectively identify the entire phenotypic expressions in text, especially for plants.<h4>Result  ...[more]

Similar Datasets

| S-EPMC4926749 | biostudies-literature
| S-EPMC5086401 | biostudies-literature
| S-EPMC5181565 | biostudies-literature
| S-EPMC6535764 | biostudies-literature
| S-EPMC6061985 | biostudies-literature
| S-EPMC3681788 | biostudies-literature
| S-EPMC2955645 | biostudies-literature
| S-EPMC9263533 | biostudies-literature
| S-EPMC8371836 | biostudies-literature
2013-12-23 | E-GEOD-53091 | biostudies-arrayexpress