Dataset Information

Efficiency analysis of competing tests for finding differentially expressed genes in lung adenocarcinoma.

ABSTRACT: In this study, we introduce and use Efficiency Analysis to compare differences in the apparent internal and external consistency of competing normalization methods and tests for identifying differentially expressed genes. Using publicly available data, two lung adenocarcinoma datasets were analyzed using caGEDA (http://bioinformatics2.pitt.edu/GE2/GEDA.html) to measure the degree of differential expression of genes existing between two populations. The datasets were randomly split into at least two subsets, each analyzed for differentially expressed genes between the two sample groups, and the gene lists compared for overlapping genes. Efficiency Analysis is an intuitive method that compares the differences in the percentage of overlap of genes from two or more data subsets, found by the same test over a range of testing methods. Tests that yield consistent gene lists across independently analyzed splits are preferred to those that yield less consistent inferences. For example, a method that exhibits 50% overlap in the 100 top genes from two studies should be preferred to a method that exhibits 5% overlap in the top 100 genes. The same procedure was performed using all available normalization and transformation methods that are available through caGEDA. The 'best' test was then further evaluated using internal cross-validation to estimate generalizable sample classification errors using a Naïve Bayes classification algorithm. A novel test, termed D1 (a derivative of the J5 test) was found to be the most consistent, and to exhibit the lowest overall classification error, and highest sensitivity and specificity. The D1 test relaxes the assumption that few genes are differentially expressed. Efficiency Analysis can be misleading if the tests exhibit a bias in any particular dimension (e.g. expression intensity); we therefore explored intensity-scaled and segmented J5 tests using data in which all genes are scaled to share the same intensity distribution range. Efficiency Analysis correctly predicted the 'best' test and normalization method using the Beer dataset and also performed well with the Bhattacharjee dataset based on both efficiency and classification accuracy criteria.

SUBMITTER: Jordan R

PROVIDER: S-EPMC2623303 | biostudies-literature | 2008

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Efficiency analysis of competing tests for finding differentially expressed genes in lung adenocarcinoma.

Jordan Rick R Patel Satish S Hu Hai H Lyons-Weiler James J

Cancer informatics 20080714

In this study, we introduce and use Efficiency Analysis to compare differences in the apparent internal and external consistency of competing normalization methods and tests for identifying differentially expressed genes. Using publicly available data, two lung adenocarcinoma datasets were analyzed using caGEDA (http://bioinformatics2.pitt.edu/GE2/GEDA.html) to measure the degree of differential expression of genes existing between two populations. The datasets were randomly split into at least ...[more]

PMID: 19259419

Similar Datasets

Project description:OBJECTIVE:To analyze the differentially expressed genes (DEGs) between lung adenocarcinoma (LUAD) and lung squamous cell carcinoma (LUSC) with bioinformatics analysis and search for potential biomarkers for clinical diagnosis of nonsmall cell lung cancer (NSCLC). METHODS:The gene expression profiling datasets of LUAD and LUSC were acquired. The transcriptome differences between LUAD and LUSC were identified using R language processing and t-test analysis. The differential expressions of the genes were shown by Venn diagram. The DEGs identified by GEO2R were analyzed with DAVID and Ingenuity Pathway Analysis (IPA) to identify the signaling pathways and biomarkers that could be used for differential diagnosis of LUAD and LUSC. The TCGA data and the biomarker expression data from clinical lung cancer samples were used to verify the differential expressions of the Osteoarthritis pathway and LXR/RXR between LUAD and LUSC. We further examined the differential expressions of miR-181 and its two target genes, WNT5A and MBD2, in 23 clinical specimens of lung squamous cell carcinoma and the paired adjacent tissues. RESULTS:GEO data analysis identified 851 DEGs (including 276 up-regulated and 575 down-regulated genes) in LUAD and 885 DEGs (including 406 up-regulated and 479 down-regulated genes) in LUSC. DAVID and IPA analysis revealed that leukocyte migration and inflammatory responses were more abundant in LUAD than in LUSC. Osteoarthritis pathway was inhibited in LUAD and activated in LUSC. IPA analysis showed that transcription factors (GATA4, RELA, YBX1, TP63 and MBD2), cytokines (WNT5A and IL1A) and microRNAs (miR-34a, miR-181b and miR-15a) differed significantly between LUAD and LUSC. miR-34a with IL-1A, miR-15a with YBX1, and miR-181b with WNT5A and MBD2 could serve as the paired microRNA and mRNA targets for differential diagnosis of NSCLC subtypes. Analysis of the clinical samples showed an increased expression of miR-181b-5p and the down-regulation of WNT5A, which could be used as molecular markers for the diagnosis of LUSC. CONCLUSIONS:Through transcriptome analysis, we identified candidate genes, paired microRNAs and pathways for differentiating LUAD and LUSC, and they can provide novel differential diagnosis and therapeutic strategies for LUAD and LUSC.

Project description:BackgroundLung adenocarcinoma is the main pathological type of non-small cell lung cancer (NSCLC). In this study, we analyzed the gene expression profile of lung adenocarcinoma tumor and paracancerous tissues by bioinformatics to assess the genes and signal pathways related to lung adenocarcinoma.MethodsThe expression data of GSE7670, GSE27262, and GSE32863 were downloaded from the Gene Expression Omnibus (GEO) database. The three microarray data sets were integrated to obtain common differential expression genes of lung adenocarcinoma tumor and adjacent tissues. The STRING database was used to construct the protein-protein interaction (PPI) network of lung adenocarcinoma and mine the gene modules and core genes in the network, and the online tools, GEPIA and Kaplan-Meier plotter were used to further verify and analyze the core genes.ResultsThere were 109 pairs of lung adenocarcinoma tissues and matched paracancerous normal lung tissues in the three data sets. Eighty-three differentially expressed genes were identified, including 16 up-regulated and 67 down-regulated genes, and 60 differentially expressed genes were successfully incorporated into the PPI network complex. Eleven core genes were identified in the PPI network complex, including three up-regulated (COMP, SPP1, COL1A1) and eight down-regulated genes (CDH5, CAV1, CLDN5, LYVE1, IL6, VWF, TEK, PECAM1). These core genes were verified by the GEPIA tumor database. Survival analysis showed that expression of the core genes was significantly related to the prognosis of lung adenocarcinoma. KEGG pathway analysis of core genes showed six genes (COMP, SPP1, COL1A1, IL6, VWF, TEK) were significantly enriched in the PI3K-Akt signaling-pathway (P=1.62E-06).ConclusionsBy analyzing the differential expression genes of lung adenocarcinoma and paracancerous normal tissues with bioinformatics, 11 genes with significant differential expression and significant influence on prognosis were identified. The findings may provide new concepts for developing diagnosis and treatment targets and prognosis markers for lung adenocarcinoma.

Project description:BACKGROUND: This paper presents a unified framework for finding differentially expressed genes (DEGs) from the microarray data. The proposed framework has three interrelated modules: (i) gene ranking, ii) significance analysis of genes and (iii) validation. The first module uses two gene selection algorithms, namely, a) two-way clustering and b) combined adaptive ranking to rank the genes. The second module converts the gene ranks into p-values using an R-test and fuses the two sets of p-values using the Fisher's omnibus criterion. The DEGs are selected using the FDR analysis. The third module performs three fold validations of the obtained DEGs. The robustness of the proposed unified framework in gene selection is first illustrated using false discovery rate analysis. In addition, the clustering-based validation of the DEGs is performed by employing an adaptive subspace-based clustering algorithm on the training and the test datasets. Finally, a projection-based visualization is performed to validate the DEGs obtained using the unified framework. RESULTS: The performance of the unified framework is compared with well-known ranking algorithms such as t-statistics, Significance Analysis of Microarrays (SAM), Adaptive Ranking, Combined Adaptive Ranking and Two-way Clustering. The performance curves obtained using 50 simulated microarray datasets each following two different distributions indicate the superiority of the unified framework over the other reported algorithms. Further analyses on 3 real cancer datasets and 3 Parkinson's datasets show the similar improvement in performance. First, a 3 fold validation process is provided for the two-sample cancer datasets. In addition, the analysis on 3 sets of Parkinson's data is performed to demonstrate the scalability of the proposed method to multi-sample microarray datasets. CONCLUSION: This paper presents a unified framework for the robust selection of genes from the two-sample as well as multi-sample microarray experiments. Two different ranking methods used in module 1 bring diversity in the selection of genes. The conversion of ranks to p-values, the fusion of p-values and FDR analysis aid in the identification of significant genes which cannot be judged based on gene ranking alone. The 3 fold validation, namely, robustness in selection of genes using FDR analysis, clustering, and visualization demonstrate the relevance of the DEGs. Empirical analyses on 50 artificial datasets and 6 real microarray datasets illustrate the efficacy of the proposed approach. The analyses on 3 cancer datasets demonstrate the utility of the proposed approach on microarray datasets with two classes of samples. The scalability of the proposed unified approach to multi-sample (more than two sample classes) microarray datasets is addressed using three sets of Parkinson's Data. Empirical analyses show that the unified framework outperformed other gene selection methods in selecting differentially expressed genes from microarray data.

Project description:BackgroundTo understand the carcinogenesis caused by accumulated genetic and epigenetic alterations and seek novel biomarkers for various cancers, studying differentially expressed genes between cancerous and normal tissues is crucial. In the study, two cDNA libraries of lung cancer were constructed and screened for identification of differentially expressed genes.MethodsTwo cDNA libraries of differentially expressed genes were constructed using lung adenocarcinoma tissue and adjacent nonmalignant lung tissue by suppression subtractive hybridization. The data of the cDNA libraries were then analyzed and compared using bioinformatics analysis. Levels of mRNA and protein were measured by quantitative real-time polymerase chain reaction (q-RT-PCR) and western blot respectively, as well as expression and localization of proteins were determined by immunostaining. Gene functions were investigated using proliferation and migration assays after gene silencing and gene over-expression.ResultsTwo libraries of differentially expressed genes were obtained. The forward-subtracted library (FSL) and the reverse-subtracted library (RSL) contained 177 and 59 genes, respectively. Bioinformatic analysis demonstrated that these genes were involved in a wide range of cellular functions. The vast majority of these genes were newly identified to be abnormally expressed in lung cancer. In the first stage of the screening for 16 genes, we compared lung cancer tissues with their adjacent non-malignant tissues at the mRNA level, and found six genes (ERGIC3, DDR1, HSP90B1, SDC1, RPSA, and LPCAT1) from the FSL were significantly up-regulated while two genes (GPX3 and TIMP3) from the RSL were significantly down-regulated (P < 0.05). The ERGIC3 protein was also over-expressed in lung cancer tissues and cultured cells, and expression of ERGIC3 was correlated with the differentiated degree and histological type of lung cancer. The up-regulation of ERGIC3 could promote cellular migration and proliferation in vitro.ConclusionsThe two libraries of differentially expressed genes may provide the basis for new insights or clues for finding novel lung cancer-related genes; several genes were newly found in lung cancer with ERGIC3 seeming a novel lung cancer-related gene. ERGIC3 may play an active role in the development and progression of lung cancer.

Dataset Information

Efficiency analysis of competing tests for finding differentially expressed genes in lung adenocarcinoma.

Publications

Efficiency analysis of competing tests for finding differentially expressed genes in lung adenocarcinoma.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets