Dataset Information

Imitating manual curation of text-mined facts in biomedicine.

ABSTRACT: Text-mining algorithms make mistakes in extracting facts from natural-language texts. In biomedical applications, which rely on use of text-mined data, it is critical to assess the quality (the probability that the message is correctly extracted) of individual facts--to resolve data conflicts and inconsistencies. Using a large set of almost 100,000 manually produced evaluations (most facts were independently reviewed more than once, producing independent evaluations), we implemented and tested a collection of algorithms that mimic human evaluation of facts provided by an automated information-extraction system. The performance of our best automated classifiers closely approached that of our human evaluators (ROC score close to 0.95). Our hypothesis is that, were we to use a larger number of human experts to evaluate any given sentence, we could implement an artificial-intelligence curator that would perform the classification job at least as accurately as an average individual human evaluator. We illustrated our analysis by visualizing the predicted accuracy of the text-mined relations involving the term cocaine.

SUBMITTER: Rodriguez-Esteban R

PROVIDER: S-EPMC1560402 | biostudies-literature | 2006 Sep

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Imitating manual curation of text-mined facts in biomedicine.

Rodriguez-Esteban Raul R Iossifov Ivan I Rzhetsky Andrey A

PLoS computational biology 20060727 9

Text-mining algorithms make mistakes in extracting facts from natural-language texts. In biomedical applications, which rely on use of text-mined data, it is critical to assess the quality (the probability that the message is correctly extracted) of individual facts--to resolve data conflicts and inconsistencies. Using a large set of almost 100,000 manually produced evaluations (most facts were independently reviewed more than once, producing independent evaluations), we implemented and tested a ...[more]

PMID: 16965176

Dataset Information

Imitating manual curation of text-mined facts in biomedicine.

Publications

Imitating manual curation of text-mined facts in biomedicine.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

A CTD-Pfizer collaboration: manual curation of 88,000 scientific articles text mined for drug-disease and drug-phenotype interactions.
| S-EPMC3842776 | biostudies-literature

Text mining and manual curation of chemical-gene-disease networks for the comparative toxicogenomics database (CTD).
| S-EPMC2768719 | biostudies-literature

Text-mined fossil biodiversity dynamics using machine learning.
| S-EPMC6501925 | biostudies-literature

Text-mined dataset of inorganic materials synthesis recipes.
| S-EPMC6794279 | biostudies-literature

PGxMine: Text mining for curation of PharmGKB.
| S-EPMC6917032 | biostudies-literature

Integration and publication of heterogeneous text-mined relationships on the Semantic Web.
| S-EPMC3102890 | biostudies-other

Manual curation is not sufficient for annotation of genomic databases.
| S-EPMC2516305 | biostudies-literature

Looking at cerebellar malformations through text-mined interactomes of mice and humans.
| S-EPMC2767227 | biostudies-literature

Prediction of protein-destabilizing polymorphisms by manual curation with protein structure.
| S-EPMC3506574 | biostudies-literature

Curation of the CANDID-PTX Dataset with Free-Text Reports.
| S-EPMC8637219 | biostudies-literature