Unknown

Dataset Information

0

Automatic document classification of biological literature.


ABSTRACT:

Background

Document classification is a wide-spread problem with many applications, from organizing search engine snippets to spam filtering. We previously described Textpresso, a text-mining system for biological literature, which marks up full text according to a shallow ontology that includes terms of biological interest. This project investigates document classification in the context of biological literature, making use of the Textpresso markup of a corpus of Caenorhabditis elegans literature.

Results

We present a two-step text categorization algorithm to classify a corpus of C. elegans papers. Our classification method first uses a support vector machine-trained classifier, followed by a novel, phrase-based clustering algorithm. This clustering step autonomously creates cluster labels that are descriptive and understandable by humans. This clustering engine performed better on a standard test-set (Reuters 21578) compared to previously published results (F-value of 0.55 vs. 0.49), while producing cluster descriptions that appear more useful. A web interface allows researchers to quickly navigate through the hierarchy and look for documents that belong to a specific concept.

Conclusion

We have demonstrated a simple method to classify biological documents that embodies an improvement over current methods. While the classification results are currently optimized for Caenorhabditis elegans papers by human-created rules, the classification engine can be adapted to different types of documents. We have demonstrated this by presenting a web interface that allows researchers to quickly navigate through the hierarchy and look for documents that belong to a specific concept.

SUBMITTER: Chen D 

PROVIDER: S-EPMC1559726 | biostudies-literature | 2006 Aug

REPOSITORIES: biostudies-literature

altmetric image

Publications

Automatic document classification of biological literature.

Chen David D   Müller Hans-Michael HM   Sternberg Paul W PW  

BMC bioinformatics 20060807


<h4>Background</h4>Document classification is a wide-spread problem with many applications, from organizing search engine snippets to spam filtering. We previously described Textpresso, a text-mining system for biological literature, which marks up full text according to a shallow ontology that includes terms of biological interest. This project investigates document classification in the context of biological literature, making use of the Textpresso markup of a corpus of Caenorhabditis elegans  ...[more]

Similar Datasets

| S-EPMC1965490 | biostudies-literature
| S-EPMC4417416 | biostudies-literature
| S-EPMC4008368 | biostudies-literature
| S-EPMC3314711 | biostudies-literature
| S-EPMC2944781 | biostudies-literature
| S-EPMC7731898 | biostudies-literature
| S-EPMC7307763 | biostudies-literature
| S-EPMC3045796 | biostudies-literature
| S-EPMC1435941 | biostudies-literature
| S-EPMC3420236 | biostudies-literature