Unknown

Dataset Information

0

Text-based over-representation analysis of microarray gene lists with annotation bias.


ABSTRACT: A major challenge in microarray data analysis is the functional interpretation of gene lists. A common approach to address this is over-representation analysis (ORA), which uses the hypergeometric test (or its variants) to evaluate whether a particular functionally defined group of genes is represented more than expected by chance within a gene list. Existing applications of ORA have been largely limited to pre-defined terminologies such as GO and KEGG. We report our explorations of whether ORA can be applied to a wider mining of free-text. We found that a hitherto underappreciated feature of experimentally derived gene lists is that the constituents have substantially more annotation associated with them, as they have been researched upon for a longer period of time. This bias, a result of patterns of research activity within the biomedical community, is a major problem for classical hypergeometric test-based ORA approaches, which cannot account for such bias. We have therefore developed three approaches to overcome this bias, and demonstrate their usability in a wide range of published datasets covering different species. A comparison with existing tools that use GO terms suggests that mining PubMed abstracts can reveal additional biological insight that may not be possible by mining pre-defined ontologies alone.

SUBMITTER: Leong HS 

PROVIDER: S-EPMC2699530 | biostudies-literature | 2009 Jun

REPOSITORIES: biostudies-literature

altmetric image

Publications

Text-based over-representation analysis of microarray gene lists with annotation bias.

Leong Hui Sun HS   Kipling David D  

Nucleic acids research 20090508 11


A major challenge in microarray data analysis is the functional interpretation of gene lists. A common approach to address this is over-representation analysis (ORA), which uses the hypergeometric test (or its variants) to evaluate whether a particular functionally defined group of genes is represented more than expected by chance within a gene list. Existing applications of ORA have been largely limited to pre-defined terminologies such as GO and KEGG. We report our explorations of whether ORA  ...[more]

Similar Datasets

| S-EPMC4404576 | biostudies-literature
| S-EPMC2896709 | biostudies-literature
| S-EPMC5728509 | biostudies-literature
| S-EPMC3654871 | biostudies-other
| S-EPMC2711946 | biostudies-literature
| S-EPMC4707541 | biostudies-literature
| S-EPMC3042180 | biostudies-literature
| S-EPMC3564935 | biostudies-literature
| S-EPMC5778030 | biostudies-literature
| S-EPMC9252805 | biostudies-literature