Dataset Information

A Novel Text-Mining Approach for Retrieving Pharmacogenomics Associations From the Literature.

ABSTRACT: Text mining in biomedical literature is an emerging field which has already been shown to have a variety of implementations in many research areas, including genetics, personalized medicine, and pharmacogenomics. In this study, we describe a novel text-mining approach for the extraction of pharmacogenomics associations. The code that was used toward this end was implemented using R programming language, either through custom scripts, where needed, or through utilizing functions from existing libraries. Articles (abstracts or full texts) that correspond to a specified query were extracted from PubMed, while concept annotations were derived by PubTator Central. Terms that denote a Mutation or a Gene as well as Chemical compound terms corresponding to drug compounds were normalized and the sentences containing the aforementioned terms were filtered and preprocessed to create appropriate training sets. Finally, after training and adequate hyperparameter tuning, four text classifiers were created and evaluated (FastText, Linear kernel SVMs, XGBoost, Lasso, and Elastic-Net Regularized Generalized Linear Models) with regard to their performance in identifying pharmacogenomics associations. Although further improvements are essential toward proper implementation of this text-mining approach in the clinical practice, our study stands as a comprehensive, simplified, and up-to-date approach for the identification and assessment of research articles enriched in clinically relevant pharmacogenomics relationships. Furthermore, this work highlights a series of challenges concerning the effective application of text mining in biomedical literature, whose resolution could substantially contribute to the further development of this field.

SUBMITTER: Pandi MT

PROVIDER: S-EPMC7748107 | biostudies-literature | 2020

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

A Novel Text-Mining Approach for Retrieving Pharmacogenomics Associations From the Literature.

Pandi Maria-Theodora MT van der Spek Peter J PJ Koromina Maria M Patrinos George P GP

Frontiers in pharmacology 20201110

Text mining in biomedical literature is an emerging field which has already been shown to have a variety of implementations in many research areas, including genetics, personalized medicine, and pharmacogenomics. In this study, we describe a novel text-mining approach for the extraction of pharmacogenomics associations. The code that was used toward this end was implemented using R programming language, either through custom scripts, where needed, or through utilizing functions from existing lib ...[more]

PMID: 33343371

Similar Datasets

Project description:BackgroundAdvances in the next generation sequencing technology has accelerated the pace of individualized medicine (IM), which aims to incorporate genetic/genomic information into medicine. One immediate need in interpreting sequencing data is the assembly of information about genetic variants and their corresponding associations with other entities (e.g., diseases or medications). Even with dedicated effort to capture such information in biological databases, much of this information remains 'locked' in the unstructured text of biomedical publications. There is a substantial lag between the publication and the subsequent abstraction of such information into databases. Multiple text mining systems have been developed, but most of them focus on the sentence level association extraction with performance evaluation based on gold standard text annotations specifically prepared for text mining systems.ResultsWe developed and evaluated a text mining system, MutD, which extracts protein mutation-disease associations from MEDLINE abstracts by incorporating discourse level analysis, using a benchmark data set extracted from curated database records. MutD achieves an F-measure of 64.3% for reconstructing protein mutation disease associations in curated database records. Discourse level analysis component of MutD contributed to a gain of more than 10% in F-measure when compared against the sentence level association extraction. Our error analysis indicates that 23 of the 64 precision errors are true associations that were not captured by database curators and 68 of the 113 recall errors are caused by the absence of associated disease entities in the abstract. After adjusting for the defects in the curated database, the revised F-measure of MutD in association detection reaches 81.5%.ConclusionsOur quantitative analysis reveals that MutD can effectively extract protein mutation disease associations when benchmarking based on curated database records. The analysis also demonstrates that incorporating discourse level analysis significantly improved the performance of extracting the protein-mutation-disease association. Future work includes the extension of MutD for full text articles.

Project description:Dromedary camels are the preferable livestock species in the arid and semi-arid regions of the world. Most of the world's camel populations are managed under a subsistence/extensive system maintained by migratory pastoralists but intensification is getting more frequent. Even though recently the welfare of camels has been receiving more attention, in many countries there are no regulations to protect their health and welfare. The objectives of this article were to explore the main research topics related to camel welfare, their distribution over time and to highlight research gaps. A literature search was performed to identify records published in English from January 1980 to March 2023 on Dromedary camel welfare via Scopus®, using "Camel welfare," "Camel behaviour," "She-camel" and "Camel management" as search words. A total of 234 records were retained for analysis after automatic and manual screening procedures. Descriptive statistics, text mining (TM) and topic analysis (TA) were performed. The result shows that even though there were fluctuations between years, records on camel welfare have increased exponentially over time. Asia was the region where most of the corresponding authors were located. The first five most frequent words were, "milk," "calv," "behaviour," "femal," and "breed," the least frequent word was "stabl." TA resulted in the five most relevant topics dealing with "Calf management and milk production," "Camel health and management system," "Female and male reproduction," "Camel behaviour and feeding," and "Camel welfare." The topics that contained the oldest records were "female and male reproduction" and "camel health and management system" (in 1980 and 1983, respectively), while the topic named "camel behaviour and feeding" had the first article published in 2000. Overall, even though topics related to camel behaviour and welfare are receiving more attention from academia, research is still needed to fully understand how to safeguard welfare in Dromedary camels.

Dataset Information

A Novel Text-Mining Approach for Retrieving Pharmacogenomics Associations From the Literature.

Publications

A Novel Text-Mining Approach for Retrieving Pharmacogenomics Associations From the Literature.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets