Dataset Information

Text-based over-representation analysis of microarray gene lists with annotation bias.

ABSTRACT: A major challenge in microarray data analysis is the functional interpretation of gene lists. A common approach to address this is over-representation analysis (ORA), which uses the hypergeometric test (or its variants) to evaluate whether a particular functionally defined group of genes is represented more than expected by chance within a gene list. Existing applications of ORA have been largely limited to pre-defined terminologies such as GO and KEGG. We report our explorations of whether ORA can be applied to a wider mining of free-text. We found that a hitherto underappreciated feature of experimentally derived gene lists is that the constituents have substantially more annotation associated with them, as they have been researched upon for a longer period of time. This bias, a result of patterns of research activity within the biomedical community, is a major problem for classical hypergeometric test-based ORA approaches, which cannot account for such bias. We have therefore developed three approaches to overcome this bias, and demonstrate their usability in a wide range of published datasets covering different species. A comparison with existing tools that use GO terms suggests that mining PubMed abstracts can reveal additional biological insight that may not be possible by mining pre-defined ontologies alone.

SUBMITTER: Leong HS

PROVIDER: S-EPMC2699530 | biostudies-literature | 2009 Jun

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Text-based over-representation analysis of microarray gene lists with annotation bias.

Leong Hui Sun HS Kipling David D

Nucleic acids research 20090508 11

A major challenge in microarray data analysis is the functional interpretation of gene lists. A common approach to address this is over-representation analysis (ORA), which uses the hypergeometric test (or its variants) to evaluate whether a particular functionally defined group of genes is represented more than expected by chance within a gene list. Existing applications of ORA have been largely limited to pre-defined terminologies such as GO and KEGG. We report our explorations of whether ORA ...[more]

PMID: 19429895

Dataset Information

Text-based over-representation analysis of microarray gene lists with annotation bias.

Publications

Text-based over-representation analysis of microarray gene lists with annotation bias.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

mBISON: Finding miRNA target over-representation in gene lists from ChIP-sequencing data.
| S-EPMC4404576 | biostudies-literature

Stability of ranked gene lists in large microarray analysis studies.
| S-EPMC2896709 | biostudies-literature

Collaborative representation-based classification of microarray gene expression data.
| S-EPMC5728509 | biostudies-literature

Compensating for literature annotation bias when predicting novel drug-disease relationships through Medical Subject Heading Over-representation Profile (MeSHOP) similarity.
| S-EPMC3654871 | biostudies-other

A-MADMAN: annotation-based microarray data meta-analysis tool.
| S-EPMC2711946 | biostudies-literature

LEGO: a novel method for gene set over-representation analysis by incorporating network-based gene weights.
| S-EPMC4707541 | biostudies-literature

MADGene: retrieval and processing of gene identifier lists for the analysis of heterogeneous microarray datasets.
| S-EPMC3042180 | biostudies-literature

Gene annotation bias impedes biomedical research.
| S-EPMC5778030 | biostudies-literature

Quantitative biomedical annotation using medical subject heading over-representation profiles (MeSHOPs).
| S-EPMC3564935 | biostudies-literature

DAVID: a web server for functional enrichment analysis and functional annotation of gene lists (2021 update).
| S-EPMC9252805 | biostudies-literature