Dataset Information

Applications of a Novel Clustering Approach Using Non-Negative Matrix Factorization to Environmental Research in Public Health.

ABSTRACT: Often data can be represented as a matrix, e.g., observations as rows and variables as columns, or as a doubly classified contingency table. Researchers may be interested in clustering the observations, the variables, or both. If the data is non-negative, then Non-negative Matrix Factorization (NMF) can be used to perform the clustering. By its nature, NMF-based clustering is focused on the large values. If the data is normalized by subtracting the row/column means, it becomes of mixed signs and the original NMF cannot be used. Our idea is to split and then concatenate the positive and negative parts of the matrix, after taking the absolute value of the negative elements. NMF applied to the concatenated data, which we call PosNegNMF, offers the advantages of the original NMF approach, while giving equal weight to large and small values. We use two public health datasets to illustrate the new method and compare it with alternative clustering methods, such as K-means and clustering methods based on the Singular Value Decomposition (SVD) or Principal Component Analysis (PCA). With the exception of situations where a reasonably accurate factorization can be achieved using the first SVD component, we recommend that the epidemiologists and environmental scientists use the new method to obtain clusters with improved quality and interpretability.

SUBMITTER: Fogel P

PROVIDER: S-EPMC4881134 | biostudies-literature | 2016 May

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Applications of a Novel Clustering Approach Using Non-Negative Matrix Factorization to Environmental Research in Public Health.

Fogel Paul P Gaston-Mathé Yann Y Hawkins Douglas D Fogel Fajwel F Luta George G Young S Stanley SS

International journal of environmental research and public health 20160518 5

Often data can be represented as a matrix, e.g., observations as rows and variables as columns, or as a doubly classified contingency table. Researchers may be interested in clustering the observations, the variables, or both. If the data is non-negative, then Non-negative Matrix Factorization (NMF) can be used to perform the clustering. By its nature, NMF-based clustering is focused on the large values. If the data is normalized by subtracting the row/column means, it becomes of mixed signs and ...[more]

PMID: 27213413

Similar Datasets

Project description:BackgroundMatrix factorization is a well established pattern discovery tool that has seen numerous applications in biomedical data analytics, such as gene expression co-clustering, patient stratification, and gene-disease association mining. Matrix factorization learns a latent data model that takes a data matrix and transforms it into a latent feature space enabling generalization, noise removal and feature discovery. However, factorization algorithms are numerically intensive, and hence there is a pressing challenge to scale current algorithms to work with large datasets. Our focus in this paper is matrix tri-factorization, a popular method that is not limited by the assumption of standard matrix factorization about data residing in one latent space. Matrix tri-factorization solves this by inferring a separate latent space for each dimension in a data matrix, and a latent mapping of interactions between the inferred spaces, making the approach particularly suitable for biomedical data mining.ResultsWe developed a block-wise approach for latent factor learning in matrix tri-factorization. The approach partitions a data matrix into disjoint submatrices that are treated independently and fed into a parallel factorization system. An appealing property of the proposed approach is its mathematical equivalence with serial matrix tri-factorization. In a study on large biomedical datasets we show that our approach scales well on multi-processor and multi-GPU architectures. On a four-GPU system we demonstrate that our approach can be more than 100-times faster than its single-processor counterpart.ConclusionsA general approach for scaling non-negative matrix tri-factorization is proposed. The approach is especially useful parallel matrix factorization implemented in a multi-GPU environment. We expect the new approach will be useful in emerging procedures for latent factor analysis, notably for data integration, where many large data matrices need to be collectively factorized.

Project description:MotivationUnderstanding the underlying mutational processes of cancer patients has been a long-standing goal in the community and promises to provide new insights that could improve cancer diagnoses and treatments. Mutational signatures are summaries of the mutational processes, and improving the derivation of mutational signatures can yield new discoveries previously obscured by technical and biological confounders. Results from existing mutational signature extraction methods depend on the size of available patient cohort and solely focus on the analysis of mutation count data without considering the exploitation of metadata.ResultsHere we present a supervised method that utilizes cancer type as metadata to extract more distinctive signatures. More specifically, we use a negative binomial non-negative matrix factorization and add a support vector machine loss. We show that mutational signatures extracted by our proposed method have a lower reconstruction error and are designed to be more predictive of cancer type than those generated by unsupervised methods. This design reduces the need for elaborate post-processing strategies in order to recover most of the known signatures unlike the existing unsupervised signature extraction methods. Signatures extracted by a supervised model used in conjunction with cancer-type labels are also more robust, especially when using small and potentially cancer-type limited patient cohorts. Finally, we adapted our model such that molecular features can be utilized to derive an according mutational signature. We used APOBEC expression and MUTYH mutation status to demonstrate the possibilities that arise from this ability. We conclude that our method, which exploits available metadata, improves the quality of mutational signatures as well as helps derive more interpretable representations.Availability and implementationhttps://github.com/ratschlab/SNBNMF-mutsig-public.Supplementary informationSupplementary data are available at Bioinformatics online.

Project description:BackgroundAlthough surgical methods are the most effective treatments for colon adenocarcinoma (COAD), the cure rates remain low, and recurrence rates remain high. Furthermore, platelet adhesion-related genes are gaining attention as potential regulators of tumorigenesis. Therefore, identifying the mechanisms responsible for the regulation of these genes in patients with COAD has become important. The present study aims to investigate the underlying mechanisms of platelet adhesion-related genes in COAD patients.MethodsThe present study was an experimental study. Initially, the effects of platelet number and related genomic alteration on survival were explored using real-world data and the cBioPortal database, respectively. Then, the differentially expressed platelet adhesion-related genes of COAD were analyzed using the TCGA database, and patients were further classified by employing the non-negative matrix factorization (NMF) analysis method. Afterward, some of the clinical and expression characteristics were analyzed between clusters. Finally, least absolute shrinkage and selection operator regression analysis was used to establish the prognostic nomogram. All data analyses were performed using the R package.ResultsHigh platelet counts are associated with worse survival in real-world patients, and alternations to platelet adhesion-related genes have resulted in poorer prognoses, based on online data. Based on platelet adhesion-related genes, patients with COAD were classified into two clusters by NMF-based clustering analysis. Cluster2 had a better overall survival, when compared to Cluster1. The gene copy number and enrichment analysis results revealed that two pathways were differentially enriched. In addition, the differentially expressed genes between these two clusters were enriched for POU6F1 in the transcription factor signaling pathway, and for MATN3 in the ceRNA network. Finally, a prognostic nomogram, which included the ALOX12 and ACTG1 genes, was established based on the platelet adhesion-related genes, with a concordance (C) index of 0.879 (0.848-0.910).ConclusionThe mRNA expression-based NMF was used to reveal the potential role of platelet adhesion-related genes in COAD. The series of experiments revealed the feasibility of targeting platelet adhesion-associated gene therapy.

Dataset Information

Applications of a Novel Clustering Approach Using Non-Negative Matrix Factorization to Environmental Research in Public Health.

Publications

Applications of a Novel Clustering Approach Using Non-Negative Matrix Factorization to Environmental Research in Public Health.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets