Dataset Information

The identification of informative genes from multiple datasets with increasing complexity.

ABSTRACT:

Background

In microarray data analysis, factors such as data quality, biological variation, and the increasingly multi-layered nature of more complex biological systems complicates the modelling of regulatory networks that can represent and capture the interactions among genes. We believe that the use of multiple datasets derived from related biological systems leads to more robust models. Therefore, we developed a novel framework for modelling regulatory networks that involves training and evaluation on independent datasets. Our approach includes the following steps: (1) ordering the datasets based on their level of noise and informativeness; (2) selection of a Bayesian classifier with an appropriate level of complexity by evaluation of predictive performance on independent data sets; (3) comparing the different gene selections and the influence of increasing the model complexity; (4) functional analysis of the informative genes.

Results

In this paper, we identify the most appropriate model complexity using cross-validation and independent test set validation for predicting gene expression in three published datasets related to myogenesis and muscle differentiation. Furthermore, we demonstrate that models trained on simpler datasets can be used to identify interactions among genes and select the most informative. We also show that these models can explain the myogenesis-related genes (genes of interest) significantly better than others (P < 0.004) since the improvement in their rankings is much more pronounced. Finally, after further evaluating our results on synthetic datasets, we show that our approach outperforms a concordance method by Lai et al. in identifying informative genes from multiple datasets with increasing complexity whilst additionally modelling the interaction between genes.

Conclusions

We show that Bayesian networks derived from simpler controlled systems have better performance than those trained on datasets from more complex biological systems. Further, we present that highly predictive and consistent genes, from the pool of differentially expressed genes, across independent datasets are more likely to be fundamentally involved in the biological process under study. We conclude that networks trained on simpler controlled systems, such as in vitro experiments, can be used to model and capture interactions among genes in more complex datasets, such as in vivo experiments, where these interactions would otherwise be concealed by a multitude of other ongoing events.

SUBMITTER: Anvar SY

PROVIDER: S-EPMC2822754 | biostudies-literature | 2010 Jan

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

The identification of informative genes from multiple datasets with increasing complexity.

Anvar S Yahya SY 't Hoen Peter A C PA Tucker Allan A

BMC bioinformatics 20100115

<h4>Background</h4>In microarray data analysis, factors such as data quality, biological variation, and the increasingly multi-layered nature of more complex biological systems complicates the modelling of regulatory networks that can represent and capture the interactions among genes. We believe that the use of multiple datasets derived from related biological systems leads to more robust models. Therefore, we developed a novel framework for modelling regulatory networks that involves training ...[more]

PMID: 20078860

Similar Datasets

Project description:The Cancer Genome Atlas (TCGA) projects have advanced our understanding of the driver mutations, genetic backgrounds, and key pathways activated across cancer types. Analysis of TCGA datasets have mostly focused on somatic mutations and translocations, with less emphasis placed on gene amplifications. Here we describe a bioinformatics screening strategy to identify putative cancer driver genes amplified across TCGA datasets. We carried out GISTIC2 analysis of TCGA datasets spanning 16 cancer subtypes and identified 486 genes that were amplified in two or more datasets. The list was narrowed to 75 cancer-associated genes with potential "druggable" properties. The majority of the genes were localized to 14 amplicons spread across the genome. To identify potential cancer driver genes, we analyzed gene copy number and mRNA expression data from individual patient samples and identified 42 putative cancer driver genes linked to diverse oncogenic processes. Oncogenic activity was further validated by siRNA/shRNA knockdown and by referencing the Project Achilles datasets. The amplified genes represented a number of gene families, including epigenetic regulators, cell cycle-associated genes, DNA damage response/repair genes, metabolic regulators, and genes linked to the Wnt, Notch, Hedgehog, JAK/STAT, NF-KB and MAPK signaling pathways. Among the 42 putative driver genes were known driver genes, such as EGFR, ERBB2 and PIK3CA. Wild-type KRAS was amplified in several cancer types, and KRAS-amplified cancer cell lines were most sensitive to KRAS shRNA, suggesting that KRAS amplification was an independent oncogenic event. A number of MAP kinase adapters were co-amplified with their receptor tyrosine kinases, such as the FGFR adapter FRS2 and the EGFR family adapters GRB2 and GRB7. The ubiquitin-like ligase DCUN1D1 and the histone methyltransferase NSD3 were also identified as novel putative cancer driver genes. We discuss the patient tailoring implications for existing cancer drug targets and we further discuss potential novel opportunities for drug discovery efforts.

Project description:BackgroundMultiple myeloma is a cancer which has a high occurrence rate and causes great injury to people worldwide. In recent years, many studies reported the effects of miRNA on the appearance of multiple myeloma. However, due to the differences of samples and sequencing platforms, a large number of inconsistent results have been generated among these studies, which limited the cure of multiple myeloma at the miRNA level.MethodsWe performed meta-analyses to identify the key miRNA biomarkers which could be applied on the treatment of multiple myeloma. The key miRNAs were determined by overlap comparisons of seven datasets in multiple myeloma. Then, the target genes for key miRNAs were predicted by the software TargetScan. Additionally, functional enrichments and binding TFs were investigated by DAVID database and Tfacts database, respectively.ResultsFirstly, comparing the normal tissues, 13 miRNAs were differently expressed miRNAs (DEMs) for at least three datasets. They were considered as key miRNAs, with 12 up-regulated (hsa-miR-106b, hsa-miR-125b, hsa-miR-130b, hsa-miR-138, hsa-miR-15b, hsa-miR-181a, hsa-miR-183, hsa-miR-191, hsa-miR-19a, hsa-miR-20a, hsa-miR-221 and hsa-miR-25) and one down-regulated (hsa-miR-223). Secondly, functional enrichment analyses indicated that target genes of the upregulated miRNAs were mainly transcript factors and enriched in transcription regulation. Besides, these genes were enriched in multiple pathways: the cancer signal pathway, insulin signal metabolic pathway, cell binding molecules, melanin generation, long-term regression and P53 signaling pathway. However, no significant enrichment was found for target genes of the down-regulated genes. Due to the distinct regulation function, four miRNAs (hsa-miR-19a has-miR-221 has-miR25 and has-miR223) were ascertained as the potential prognostic and diagnostic markers in MM. Thirdly, transcript factors analysis unveiled that there were 148 TFs and 60 TFs which bind target genes of the up-regulated miRNAs and target genes of the down-regulated miRNAs, respectively. They respectively generated 652 and 139 reactions of TFs and target genes. Additionally, 50 (31.6%) TFs were shared, while higher specificity was found in TFs of target genes for the upregulated miRNAs.DiscussionsTogether, our findings provided the key miRNAs which affected occurrence of multiple myeloma and regulation function of these miRNAs. It is valuable for the prognosis and diagnosis of multiple myeloma.

Dataset Information

The identification of informative genes from multiple datasets with increasing complexity.

Background

Results

Conclusions

Publications

The identification of informative genes from multiple datasets with increasing complexity.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets