Dataset Information

A graph-search framework for associating gene identifiers with documents.

ABSTRACT: One step in the model organism database curation process is to find, for each article, the identifier of every gene discussed in the article. We consider a relaxation of this problem suitable for semi-automated systems, in which each article is associated with a ranked list of possible gene identifiers, and experimentally compare methods for solving this geneId ranking problem. In addition to baseline approaches based on combining named entity recognition (NER) systems with a "soft dictionary" of gene synonyms, we evaluate a graph-based method which combines the outputs of multiple NER systems, as well as other sources of information, and a learning method for reranking the output of the graph-based method.We show that named entity recognition (NER) systems with similar F-measure performance can have significantly different performance when used with a soft dictionary for geneId-ranking. The graph-based approach can outperform any of its component NER systems, even without learning, and learning can further improve the performance of the graph-based ranking approach.The utility of a named entity recognition (NER) system for geneId-finding may not be accurately predicted by its entity-level F1 performance, the most common performance measure. GeneId-ranking systems are best implemented by combining several NER systems. With appropriate combination methods, usefully accurate geneId-ranking systems can be constructed based on easily-available resources, without resorting to problem-specific, engineered components.

SUBMITTER: Cohen WW

PROVIDER: S-EPMC1617121 | biostudies-other | 2006 Oct

REPOSITORIES: biostudies-other

ACCESS DATA

Publications

A graph-search framework for associating gene identifiers with documents.

Cohen William W WW Minkov Einat E

BMC bioinformatics 20061010

<h4>Background</h4>One step in the model organism database curation process is to find, for each article, the identifier of every gene discussed in the article. We consider a relaxation of this problem suitable for semi-automated systems, in which each article is associated with a ranked list of possible gene identifiers, and experimentally compare methods for solving this geneId ranking problem. In addition to baseline approaches based on combining named entity recognition (NER) systems with a ...[more]

PMID: 17032441

Similar Datasets

Project description:Biomedical semantic indexing is a very useful support tool for human curators in their efforts for indexing and cataloging the biomedical literature.The aim of this study was to describe a system to automatically assign Medical Subject Headings (MeSH) to biomedical articles from MEDLINE.Our approach relies on the assumption that similar documents should be classified by similar MeSH terms. Although previous work has already exploited the document similarity by using a k-nearest neighbors algorithm, we represent documents as document vectors by search engine indexing and then compute the similarity between documents using cosine similarity. Once the most similar documents for a given input document are retrieved, we rank their MeSH terms to choose the most suitable set for the input document. To do this, we define a scoring function that takes into account the frequency of the term into the set of retrieved documents and the similarity between the input document and each retrieved document. In addition, we implement guidelines proposed by human curators to annotate MEDLINE articles; in particular, the heuristic that says if 3 MeSH terms are proposed to classify an article and they share the same ancestor, they should be replaced by this ancestor. The representation of the MeSH thesaurus as a graph database allows us to employ graph search algorithms to quickly and easily capture hierarchical relationships such as the lowest common ancestor between terms.Our experiments show promising results with an F1 of 69% on the test dataset.To the best of our knowledge, this is the first work that combines search and graph database technologies for the task of biomedical semantic indexing. Due to its horizontal scalability, ElasticSearch becomes a real solution to index large collections of documents (such as the bibliographic database MEDLINE). Moreover, the use of graph search algorithms for accessing MeSH information could provide a support tool for cataloging MEDLINE abstracts in real time.

Project description:BACKGROUND:Data have become an essential factor in driving health research and are key to the development of personalized and precision medicine. Primary and secondary use of personal data holds significant potential for research; however, it also introduces a new set of challenges around consent processes, privacy, and data sharing. Research institutions have issued ethical guidelines to address challenges and ensure responsible data processing and data sharing. However, ethical guidelines directed at researchers and medical professionals are often complex; require readers who are familiar with specific terminology; and can be hard to understand for people without sufficient background knowledge in legislation, research, and data processing practices. OBJECTIVE:This study aimed to visually represent an ethics framework to make its content more accessible to its stakeholders. More generally, we wanted to explore the potential of visualizing policy documents to combat and prevent research misconduct by improving the capacity of actors in health research to handle data responsibly. METHODS:We used a mixed methods approach based on knowledge visualization with 3 sequential steps: qualitative content analysis (open and axial coding, among others); visualizing the knowledge structure, which resulted from the previous step; and adding interactive functionality to access information using rapid prototyping. RESULTS:Through our iterative methodology, we developed a tool that allows users to explore an ethics framework for data sharing through an interactive visualization. Our results represent an approach that can make policy documents easier to understand and, therefore, more applicable in practice. CONCLUSIONS:Meaningful communication and understanding each other remain a challenge in various areas of health care and medicine. We contribute to advancing communication practices through the introduction of knowledge visualization to bioethics to offer a novel way to tackle this relevant issue.

Dataset Information

A graph-search framework for associating gene identifiers with documents.

Publications

A graph-search framework for associating gene identifiers with documents.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure