Dataset Information

Building a protein name dictionary from full text: a machine learning term extraction approach.

ABSTRACT:

Background

The majority of information in the biological literature resides in full text articles, instead of abstracts. Yet, abstracts remain the focus of many publicly available literature data mining tools. Most literature mining tools rely on pre-existing lexicons of biological names, often extracted from curated gene or protein databases. This is a limitation, because such databases have low coverage of the many name variants which are used to refer to biological entities in the literature.

Results

We present an approach to recognize named entities in full text. The approach collects high frequency terms in an article, and uses support vector machines (SVM) to identify biological entity names. It is also computationally efficient and robust to noise commonly found in full text material. We use the method to create a protein name dictionary from a set of 80,528 full text articles. Only 8.3% of the names in this dictionary match SwissProt description lines. We assess the quality of the dictionary by studying its protein name recognition performance in full text.

Conclusion

This dictionary term lookup method compares favourably to other published methods, supporting the significance of our direct extraction approach. The method is strong in recognizing name variants not found in SwissProt.

SUBMITTER: Shi L

PROVIDER: S-EPMC1090555 | biostudies-literature | 2005 Apr

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Building a protein name dictionary from full text: a machine learning term extraction approach.

Shi Lei L Campagne Fabien F

BMC bioinformatics 20050407

<h4>Background</h4>The majority of information in the biological literature resides in full text articles, instead of abstracts. Yet, abstracts remain the focus of many publicly available literature data mining tools. Most literature mining tools rely on pre-existing lexicons of biological names, often extracted from curated gene or protein databases. This is a limitation, because such databases have low coverage of the many name variants which are used to refer to biological entities in the lit ...[more]

PMID: 15817129

Similar Datasets

Project description:BackgroundPhysicians are hesitant to forgo the opportunity of entering unstructured clinical notes for structured data entry in electronic health records. Does free text increase informational value in comparison with structured data?ObjectiveThis study aims to compare information from unstructured text-based chief complaints harvested and processed by a natural language processing (NLP) algorithm with clinician-entered structured diagnoses in terms of their potential utility for automated improvement of patient workflows.MethodsElectronic health records of 293,298 patient visits at the emergency department of a Swiss university hospital from January 2014 to October 2021 were analyzed. Using emergency department overcrowding as a case in point, we compared supervised NLP-based keyword dictionaries of symptom clusters from unstructured clinical notes and clinician-entered chief complaints from a structured drop-down menu with the following 2 outcomes: hospitalization and high Emergency Severity Index (ESI) score.ResultsOf 12 symptom clusters, the NLP cluster was substantial in predicting hospitalization in 11 (92%) clusters; 8 (67%) clusters remained significant even after controlling for the cluster of clinician-determined chief complaints in the model. All 12 NLP symptom clusters were significant in predicting a low ESI score, of which 9 (75%) remained significant when controlling for clinician-determined chief complaints. The correlation between NLP clusters and chief complaints was low (r=-0.04 to 0.6), indicating complementarity of information.ConclusionsThe NLP-derived features and clinicians' knowledge were complementary in explaining patient outcome heterogeneity. They can provide an efficient approach to patient flow management, for example, in an emergency medicine setting. We further demonstrated the feasibility of creating extensive and precise keyword dictionaries with NLP by medical experts without requiring programming knowledge. Using the dictionary, we could classify short and unstructured clinical texts into diagnostic categories defined by the clinician.

Project description:BackgroundThe Portable Document Format (PDF) is the most commonly used file format for online scientific publications. The absence of effective means to extract text from these PDF files in a layout-aware manner presents a significant challenge for developers of biomedical text mining or biocuration informatics systems that use published literature as an information source. In this paper we introduce the 'Layout-Aware PDF Text Extraction' (LA-PDFText) system to facilitate accurate extraction of text from PDF files of research articles for use in text mining applications.ResultsOur paper describes the construction and performance of an open source system that extracts text blocks from PDF-formatted full-text research articles and classifies them into logical units based on rules that characterize specific sections. The LA-PDFText system focuses only on the textual content of the research articles and is meant as a baseline for further experiments into more advanced extraction methods that handle multi-modal content, such as images and graphs. The system works in a three-stage process: (1) Detecting contiguous text blocks using spatial layout processing to locate and identify blocks of contiguous text, (2) Classifying text blocks into rhetorical categories using a rule-based method and (3) Stitching classified text blocks together in the correct order resulting in the extraction of text from section-wise grouped blocks. We show that our system can identify text blocks and classify them into rhetorical categories with Precision1 = 0.96% Recall = 0.89% and F1 = 0.91%. We also present an evaluation of the accuracy of the block detection algorithm used in step 2. Additionally, we have compared the accuracy of the text extracted by LA-PDFText to the text from the Open Access subset of PubMed Central. We then compared this accuracy with that of the text extracted by the PDF2Text system, 2commonly used to extract text from PDF. Finally, we discuss preliminary error analysis for our system and identify further areas of improvement.ConclusionsLA-PDFText is an open-source tool for accurately extracting text from full-text scientific articles. The release of the system is available at http://code.google.com/p/lapdftext/.

Project description:BackgroundResilience is an accepted strengths-based concept that responds to change, adversity, and crises. This concept underpins both personal and community-based preventive approaches to mental health issues and shapes digital interventions. Online mental health peer-support forums have played a prominent role in enhancing resilience by providing accessible places for sharing lived experiences of mental issues and finding support. However, little research has been conducted on whether and how resilience is realized, hindering service providers' ability to optimize resilience outcomes.ObjectiveThis study aimed to create a resilience dictionary that reflects the characteristics and realization of resilience within online mental health peer-support forums. The findings can be used to guide further analysis and improve resilience outcomes in mental health forums through targeted moderation and management.MethodsA semiautomatic approach to creating a resilience dictionary was proposed using topic modeling and qualitative content analysis. We present a systematic 4-phase analysis pipeline that preprocesses raw forum posts, discovers core themes, conceptualizes resilience indicators, and generates a resilience dictionary. Our approach was applied to a mental health forum run by SANE (Schizophrenia: A National Emergency) Australia, with 70,179 forum posts between 2018 and 2020 by 2357 users being analyzed.ResultsThe resilience dictionary and taxonomy developed in this study, reveal how resilience indicators (ie, "social capital," "belonging," "learning," "adaptive capacity," and "self-efficacy") are characterized by themes commonly discussed in the forums; each theme's top 10 most relevant descriptive terms and their synonyms; and the relatedness of resilience, reflecting a taxonomy of indicators that are more comprehensive (or compound) and more likely to facilitate the realization of others. The study showed that the resilience indicators "learning," "belonging," and "social capital" were more commonly realized, and "belonging" and "learning" served as foundations for "social capital" and "adaptive capacity" across the 2-year study period.ConclusionsThis study presents a resilience dictionary that improves our understanding of how aspects of resilience are realized in web-based mental health forums. The dictionary provides novel guidance on how to improve training to support and enhance automated systems for moderating mental health forum discussions.

Dataset Information

Building a protein name dictionary from full text: a machine learning term extraction approach.

Background

Results

Conclusion

Publications

Building a protein name dictionary from full text: a machine learning term extraction approach.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets