Dataset Information

A network-assisted co-clustering algorithm to discover cancer subtypes based on gene expression.

ABSTRACT:

Background

Cancer subtype information is critically important for understanding tumor heterogeneity. Existing methods to identify cancer subtypes have primarily focused on utilizing generic clustering algorithms (such as hierarchical clustering) to identify subtypes based on gene expression data. The network-level interaction among genes, which is key to understanding the molecular perturbations in cancer, has been rarely considered during the clustering process. The motivation of our work is to develop a method that effectively incorporates molecular interaction networks into the clustering process to improve cancer subtype identification.

Results

We have developed a new clustering algorithm for cancer subtype identification, called "network-assisted co-clustering for the identification of cancer subtypes" (NCIS). NCIS combines gene network information to simultaneously group samples and genes into biologically meaningful clusters. Prior to clustering, we assign weights to genes based on their impact in the network. Then a new weighted co-clustering algorithm based on a semi-nonnegative matrix tri-factorization is applied. We evaluated the effectiveness of NCIS on simulated datasets as well as large-scale Breast Cancer and Glioblastoma Multiforme patient samples from The Cancer Genome Atlas (TCGA) project. NCIS was shown to better separate the patient samples into clinically distinct subtypes and achieve higher accuracy on the simulated datasets to tolerate noise, as compared to consensus hierarchical clustering.

Conclusions

The weighted co-clustering approach in NCIS provides a unique solution to incorporate gene network information into the clustering process. Our tool will be useful to comprehensively identify cancer subtypes that would otherwise be obscured by cancer heterogeneity, using high-throughput and high-dimensional gene expression data.

SUBMITTER: Liu Y

PROVIDER: S-EPMC3916445 | biostudies-literature | 2014 Feb

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

A network-assisted co-clustering algorithm to discover cancer subtypes based on gene expression.

Liu Yiyi Y Gu Quanquan Q Hou Jack P JP Han Jiawei J Ma Jian J

BMC bioinformatics 20140204

<h4>Background</h4>Cancer subtype information is critically important for understanding tumor heterogeneity. Existing methods to identify cancer subtypes have primarily focused on utilizing generic clustering algorithms (such as hierarchical clustering) to identify subtypes based on gene expression data. The network-level interaction among genes, which is key to understanding the molecular perturbations in cancer, has been rarely considered during the clustering process. The motivation of our wo ...[more]

PMID: 24491042

Similar Datasets

Project description:BackgroundDiabetic nephropathy (DN) is the major complication of diabetes mellitus, and leading cause of end-stage renal disease. The underlying molecular mechanism of DN is not yet completely clear. The aim of this study was to analyze a DN microarray dataset using weighted gene co-expression network analysis (WGCNA) algorithm for better understanding of DN pathogenesis and exploring key genes in the disease progression.MethodsThe identified differentially expressed genes (DEGs) in DN dataset GSE47183 were introduced to WGCNA algorithm to construct co-expression modules. STRING database was used for construction of Protein-protein interaction (PPI) networks of the genes in all modules and the hub genes were identified considering both the degree centrality in the PPI networks and the ranked lists of weighted networks. Gene ontology and Reactome pathway enrichment analyses were performed on each module to understand their involvement in the biological processes and pathways. Following validation of the hub genes in another DN dataset (GSE96804), their up-stream regulators, including microRNAs and transcription factors were predicted and a regulatory network comprising of all these molecules was constructed.ResultsAfter normalization and analysis of the dataset, 2475 significant DEGs were identified and clustered into six different co-expression modules by WGCNA algorithm. Then, DEGs of each module were subjected to functional enrichment analyses and PPI network constructions. Metabolic processes, cell cycle control, and apoptosis were among the top enriched terms. In the next step, 23 hub genes were identified among the modules in genes and five of them, including FN1, SLC2A2, FABP1, EHHADH and PIPOX were validated in another DN dataset. In the regulatory network, FN1 was the most affected hub gene and mir-27a and REAL were recognized as two main upstream-regulators of the hub genes.ConclusionsThe identified hub genes from the hearts of co-expression modules could widen our understanding of the DN development and might be of targets of future investigations, exploring their therapeutic potentials for treatment of this complicated disease.

Project description:One of the most common smoking-related diseases, chronic obstructive pulmonary disease (COPD), results from a dysregulated, multi-tissue inflammatory response to cigarette smoke. We hypothesized that systemic inflammatory signals in genome-wide blood gene expression can identify clinically important COPD-related disease subtypes, and we leveraged pre-existing gene interaction networks to guide unsupervised clustering of blood microarray expression data. Using network-informed non-negative matrix factorization, we analyzed genome-wide blood gene expression from 229 former smokers in the ECLIPSE Study, and we identified novel, clinically relevant molecular subtypes of COPD. These network-informed clusters were more stable and more strongly associated with measures of lung structure and function than clusters derived from a network-naÃ¯ve approach, and they were associated with subtype-specific enrichment for inflammatory and protein catabolic pathways. These clusters were successfully reproduced in an independent sample of 135 smokers from the COPDGene Study. Briefly, gene expression was derived from whole blood samples in ECLIPSE subjects and peripheral blood mononuclear cells (PBMCs) for the COPDGene subjects. Gene expression profiling was performed using the Affymetrix Human U133 Plus2 array. Gene expression data were log-transformed, and background correction and normalization were performed for the merged ECLIPSE and COPDGene samples using robust multi-array averaging and quantile normalization as implemented in the affy Bioconductor package[27]. Of the 136 COPDGene subjects reported in a previous publication[13], one self-reported African-American subject was removed from analysis, which was conducted on the remaining 135 non-Hispanic white subjects. To identify a set of genes associated with COPD, we performed differential expression analysis for 38,519 probesets in ECLIPSE that passed quality control measures. Normalized probeset intensities were related to measures indicative of two primary dimensions of pulmonary impairment in COPD airway obstruction as indicated by two measures of spirometric lung function (FEV1 (% of predicted) and FEV1/FVC) and lung parenchymal destruction, i.e., emphysema (as quantified by the percentage of low attenuation area less than -950 Hounsfield units on lung computed tomography, %LAA-950). The analysis was conducted using the limma Bioconductor package, and the false discovery rate was controlled at 5%. The following covariates were included in the differential expression analysis age, pack-years of cigarette smoke exposure, and gender. After standardizing gene expression data from 229 ECLIPSE subjects by the variance of each probe set, we applied NMF[29] and NBS[6] to identify meta-patients (i.e. subtypes or subject clusters) and meta-genes (i.e. representative subtype expression profiles). Cross-sectional study of smokers. 229 subjects from the ECLIPSE study were analyzed in the model discovery phase. 135 subjects from the COPDGene Study (GSE42057) were used for replication. Please note that the entire data set for total 364 samples including the re-analyzed samples is provided in the *364samples.txt files.

Project description:One of the most common smoking-related diseases, chronic obstructive pulmonary disease (COPD), results from a dysregulated, multi-tissue inflammatory response to cigarette smoke. We hypothesized that systemic inflammatory signals in genome-wide blood gene expression can identify clinically important COPD-related disease subtypes, and we leveraged pre-existing gene interaction networks to guide unsupervised clustering of blood microarray expression data. Using network-informed non-negative matrix factorization, we analyzed genome-wide blood gene expression from 229 former smokers in the ECLIPSE Study, and we identified novel, clinically relevant molecular subtypes of COPD. These network-informed clusters were more stable and more strongly associated with measures of lung structure and function than clusters derived from a network-naïve approach, and they were associated with subtype-specific enrichment for inflammatory and protein catabolic pathways. These clusters were successfully reproduced in an independent sample of 135 smokers from the COPDGene Study. Briefly, gene expression was derived from whole blood samples in ECLIPSE subjects and peripheral blood mononuclear cells (PBMCs) for the COPDGene subjects. Gene expression profiling was performed using the Affymetrix Human U133 Plus2 array. Gene expression data were log-transformed, and background correction and normalization were performed for the merged ECLIPSE and COPDGene samples using robust multi-array averaging and quantile normalization as implemented in the affy Bioconductor package[27]. Of the 136 COPDGene subjects reported in a previous publication[13], one self-reported African-American subject was removed from analysis, which was conducted on the remaining 135 non-Hispanic white subjects. To identify a set of genes associated with COPD, we performed differential expression analysis for 38,519 probesets in ECLIPSE that passed quality control measures. Normalized probeset intensities were related to measures indicative of two primary dimensions of pulmonary impairment in COPD airway obstruction as indicated by two measures of spirometric lung function (FEV1 (% of predicted) and FEV1/FVC) and lung parenchymal destruction, i.e., emphysema (as quantified by the percentage of low attenuation area less than -950 Hounsfield units on lung computed tomography, %LAA-950). The analysis was conducted using the limma Bioconductor package, and the false discovery rate was controlled at 5%. The following covariates were included in the differential expression analysis age, pack-years of cigarette smoke exposure, and gender.

Project description:Vehicular Adhoc Network (VANET) suffers from the loss of perilous data packets and disruption of links due to the fast movement of vehicles and dynamic network topology. Moreover, the reliability of the vehicular network is also threatened by malicious vehicles and messages. The malicious vehicle can promulgate fake messages to the node to misguide it, which may result in the loss of precious lives. In this situation, maintaining efficient, reliable, and secure communication among automobiles is of extreme importance, especially for a densely populated network. One of the remedies is vehicular clustering, which can effectively perform in a high-density network. However, secure cluster formation and cluster optimization are important factors to consider during the clustering process because non-optimal clusters may incur high end-to-end communication delays and produce overhead on the network. In addition, malicious nodes and packets reduce passenger and driver safety, increase road accidents, and waste passenger and driver time. To this end, we employ Arithmetic Optimization Algorithm (AOA) to design a secure intelligent clustering named AOACNET. AOA is used to achieve optimality of vehicular clusters. During cluster formation, the algorithm prevents unauthentic nodes from becoming cluster members by taking into consideration the performance value of each automobile. The vehicle's performance value is based on the record of data transmission. If a vehicle transmits a fake message, it will receive a penalty of (-1), and in the case of transmitting a legitimate message, a reward of (+1) will be assigned to the vehicle. Initially, all the vehicles have equal performance value which either increase or decrease based on communication with their peers. The vehicles will become cluster members only if their performance value is greater than the threshold value (0). AOACNET is tested in MATLAB using various evaluation metrics (i.e., number of clusters, load balancing, computational time, network overhead and delay). The simulation results show that the proposed algorithm performs up to 25% better than the similar contenders in terms of designated optimization objectives.

Dataset Information

A network-assisted co-clustering algorithm to discover cancer subtypes based on gene expression.

Background

Results

Conclusions

Publications

A network-assisted co-clustering algorithm to discover cancer subtypes based on gene expression.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets