Dataset Information

Large-scale event extraction from literature with multi-level gene normalization.

ABSTRACT: Text mining for the life sciences aims to aid database curation, knowledge summarization and information retrieval through the automated processing of biomedical texts. To provide comprehensive coverage and enable full integration with existing biomolecular database records, it is crucial that text mining tools scale up to millions of articles and that their analyses can be unambiguously linked to information recorded in resources such as UniProt, KEGG, BioGRID and NCBI databases. In this study, we investigate how fully automated text mining of complex biomolecular events can be augmented with a normalization strategy that identifies biological concepts in text, mapping them to identifiers at varying levels of granularity, ranging from canonicalized symbols to unique gene and proteins and broad gene families. To this end, we have combined two state-of-the-art text mining components, previously evaluated on two community-wide challenges, and have extended and improved upon these methods by exploiting their complementary nature. Using these systems, we perform normalization and event extraction to create a large-scale resource that is publicly available, unique in semantic scope, and covers all 21.9 million PubMed abstracts and 460 thousand PubMed Central open access full-text articles. This dataset contains 40 million biomolecular events involving 76 million gene/protein mentions, linked to 122 thousand distinct genes from 5032 species across the full taxonomic tree. Detailed evaluations and analyses reveal promising results for application of this data in database and pathway curation efforts. The main software components used in this study are released under an open-source license. Further, the resulting dataset is freely accessible through a novel API, providing programmatic and customized access (http://www.evexdb.org/api/v001/). Finally, to allow for large-scale bioinformatic analyses, the entire resource is available for bulk download from http://evexdb.org/download/, under the Creative Commons - Attribution - Share Alike (CC BY-SA) license.

SUBMITTER: Van Landeghem S

PROVIDER: S-EPMC3629104 | biostudies-literature | 2013

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Large-scale event extraction from literature with multi-level gene normalization.

Van Landeghem Sofie S Björne Jari J Wei Chih-Hsuan CH Hakala Kai K Pyysalo Sampo S Ananiadou Sophia S Kao Hung-Yu HY Lu Zhiyong Z Salakoski Tapio T Van de Peer Yves Y Ginter Filip F

PloS one 20130417 4

Text mining for the life sciences aims to aid database curation, knowledge summarization and information retrieval through the automated processing of biomedical texts. To provide comprehensive coverage and enable full integration with existing biomolecular database records, it is crucial that text mining tools scale up to millions of articles and that their analyses can be unambiguously linked to information recorded in resources such as UniProt, KEGG, BioGRID and NCBI databases. In this study, ...[more]

PMID: 23613707

Similar Datasets

Project description:BackgroundEntity relation extraction technology can be used to extract entities and relations from medical literature, and automatically establish professional mapping knowledge domains. The classical text classification model, convolutional neural networks for sentence classification (TEXTCNN), has been shown to have good classification performance, but also has a long-distance dependency problem, which is a common problem of convolutional neural networks (CNNs). Recurrent neural networks (RNN) address the long-distance dependency problem but cannot capture text features at a specific scale in the text.MethodsTo solve these problems, this study sought to establish a model with a multi-scale convolutional recurrent neural network for Sentence Classification (TEXTCRNN) to address the deficiencies in the 2 neural network structures. In entity relation extraction, the entity pair is generally composed of a subject and an object, but as the subject in the entity pair of medical literature is always omitted, it is difficult to use this coding method to obtain general entity position information. Thus, we proposed a new coding method to obtain entity position information to re-establish the relationship between subject and object and complete the entity relation extraction.ResultsBy comparing the benchmark neural network model and 2 typical multi-scale TEXTCRNN models, the TEXTCRNN [bidirectional long- and short-term memory (BiLSTM)] and TEXTCRNN [double-layer stacking gated recurrent unit (GRU)], the results showed that the multi-scale CRNN model had the best F1 value performance, and the TEXTCRNN (double-layer stacking GRU) was more capable of entity relation classification when the same entity word did not belong to the same entity relation.ConclusionsThe experimental results of the entity relation extraction from Pharmacopoeia of the People's Republic of China-Guidelines for Clinical Drug Use-Volume of Chemical Drugs and Biological Products showed that entity relation extraction could effectively proceed using the new labeling method. Additionally, compared to typical neural network models, including the TEXTCNN, GRU, and BiLSTM, the multi-scale convolutional recurrent neural network structure had advantages across several evaluation indicators.

Project description:Many contemporary neuroscience experiments utilize high-throughput approaches to simultaneously collect behavioural data from many animals. The resulting data are often complex in structure and are subjected to systematic biases, which require new approaches for analysis and normalization. This study addressed the normalization need by establishing an approach based on linear-regression modeling. The model was established using a dataset of visual motor response (VMR) obtained from several strains of wild-type (WT) zebrafish collected at multiple stages of development. The VMR is a locomotor response triggered by drastic light change, and is commonly measured repeatedly from multiple larvae arrayed in 96-well plates. This assay is subjected to several systematic variations. For example, the light emitted by the machine varies slightly from well to well. In addition to the light-intensity variation, biological replication also created batch-batch variation. These systematic variations may result in differences in the VMR and must be normalized. Our normalization approach explicitly modeled the effect of these systematic variations on VMR. It also normalized the activity profiles of different conditions to a common baseline. Our approach is versatile, as it can incorporate different normalization needs as separate factors. The versatility was demonstrated by an integrated normalization of three factors: light-intensity variation, batch-batch variation and baseline. After normalization, new biological insights were revealed from the data. For example, we found larvae of TL strain at 6 days post-fertilization (dpf) responded to light onset much stronger than the 9-dpf larvae, whereas previous analysis without normalization shows that their responses were relatively comparable. By removing systematic variations, our model-based normalization can facilitate downstream statistical comparisons and aid detecting true biological differences in high-throughput studies of neurobehaviour.

Dataset Information

Large-scale event extraction from literature with multi-level gene normalization.

Publications

Large-scale event extraction from literature with multi-level gene normalization.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets