Dataset Information

Identifying overrepresented concepts in gene lists from literature: a statistical approach based on Poisson mixture model.

ABSTRACT:

Background

Large-scale genomic studies often identify large gene lists, for example, the genes sharing the same expression patterns. The interpretation of these gene lists is generally achieved by extracting concepts overrepresented in the gene lists. This analysis often depends on manual annotation of genes based on controlled vocabularies, in particular, Gene Ontology (GO). However, the annotation of genes is a labor-intensive process; and the vocabularies are generally incomplete, leaving some important biological domains inadequately covered.

Results

We propose a statistical method that uses the primary literature, i.e. free-text, as the source to perform overrepresentation analysis. The method is based on a statistical framework of mixture model and addresses the methodological flaws in several existing programs. We implemented this method within a literature mining system, BeeSpace, taking advantage of its analysis environment and added features that facilitate the interactive analysis of gene sets. Through experimentation with several datasets, we showed that our program can effectively summarize the important conceptual themes of large gene sets, even when traditional GO-based analysis does not yield informative results.

Conclusions

We conclude that the current work will provide biologists with a tool that effectively complements the existing ones for overrepresentation analysis from genomic experiments. Our program, Genelist Analyzer, is freely available at: http://workerbee.igb.uiuc.edu:8080/BeeSpace/Search.jsp.

SUBMITTER: He X

PROVIDER: S-EPMC2885378 | biostudies-literature | 2010 May

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Identifying overrepresented concepts in gene lists from literature: a statistical approach based on Poisson mixture model.

He Xin X Sarma Moushumi Sen MS Ling Xu X Chee Brant B Zhai Chengxiang C Schatz Bruce B

BMC bioinformatics 20100520

<h4>Background</h4>Large-scale genomic studies often identify large gene lists, for example, the genes sharing the same expression patterns. The interpretation of these gene lists is generally achieved by extracting concepts overrepresented in the gene lists. This analysis often depends on manual annotation of genes based on controlled vocabularies, in particular, Gene Ontology (GO). However, the annotation of genes is a labor-intensive process; and the vocabularies are generally incomplete, lea ...[more]

PMID: 20487560

Similar Datasets

Project description:ObjectiveThe purpose of this study was to cluster individuals into groups with different dental health characteristics and make statistical inferences on the effect differences among different groups simultaneously to identify the heterogeneity of risk factors in Chinese adolescents by analysing the data from the 4th Chinese National Oral Health Survey.MethodsFor decayed, missing and filled permanent teeth (DMFT), mean values were statistically analysed for their relationships with different categories of all involved variables. As DMFT scores only have discrete values, Poisson mixture regression was adopted to model the heterogeneity and complex patterns in the association and to detect the subgroup. The Bayesian information criterion (BIC) was used to determine the optimal number of subgroups. A series of Wald tests were used to explore the relationship between risk factors including the interaction effects and the number of DMFT.ResultsA total of 100 986 individuals aged 12-15 years old were analysed. The model clustered different individuals into three subgroups and built three submodels for detailed statistical inference simultaneously. The number of individuals in the three subgroups were 52 576 (52.1%), 41 969 (41.5%) and 6441 (6.4%), respectively. The mean (SD) of DMFT of the three subgroups was 0.50 (1.05), 0.99 (1.21), 5.59 (2.50). The model fitting results indicated that the effects of all risk factors on DMFT appear to be different in three subgroups. Controlling the confounding effects, our analysis implied that the regional inequality was the main contributing factor to dental caries among adolescents in Chinese mainland.ConclusionsThe risk factors of dental caries exhibited heterogeneity in groups with different characteristics. The Poisson mixture regression model could cluster individuals into groups and identify the heterogeneous effects of risk factors among different groups. The findings support the need for different targeted interventions and prevention measures in groups with different dental health characteristics.

Dataset Information

Identifying overrepresented concepts in gene lists from literature: a statistical approach based on Poisson mixture model.

Background

Results

Conclusions

Publications

Identifying overrepresented concepts in gene lists from literature: a statistical approach based on Poisson mixture model.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets