De novo Gene Signature Identification from Single-Cell RNA-Seq with Hierarchical Poisson Factorization
Ontology highlight
ABSTRACT: Common approaches to gene signature discovery in single cell RNA-sequencing depend upon predefined structures like clustering or pseudo-temporal orderings, do not account for the sparsity of single cell data, or require prior normalization. We present single cell Hierarchical Poisson Factorization (scHPF), a Bayesian factorization method that adapts Hierarchical Poisson Factorization for de novo discovery of both continuous and discrete expression patterns in complex tissues. scHPF does not require prior normalization and outperforms other methods in benchmark datasets. Applied to single cell RNA-sequencing of the core and margin of a high-grade glioma, scHPF uncovers subtle regional expression biases within glioma subpopulations and an expression signature associated with inferior survival in glioblastoma.
Project description:Microarray data analysis Intensity of 21,939 gene features per array was extracted from scanned microarray images using Feature Extraction 5.1.1 software (Agilent Technologies), which performs background subtractions and dye normalization. This normalization method is targeted at detecting changes in relative expression of individual genes rather than global expression. Global expression change would require external normalization controls (van de Peppel et al., 2003). Text output was processed using an application developed in-house to perform ANOVA analysis (http://lgsun.grc.nia.nih.gov/ANOVA/). Intensity of features measured with a >50% error were replaced with missing values except features with very low intensity. Surrogate values equal to mean error were inserted for values that were negative or less than the probe error. Data were analyzed using ANOVA with embryonic stage as a factor. The small number of biological replications typical in expression profiling experiments results in a highly variable error variance, and this problem is usually addressed by log-ratio thresholds (Schena et al., 1995) that require subjective decisions about biological significance, or by Bayesian adjustment of error variance (Baldi and Long, 2001), which may still underestimate error variance and result in false positive results. To reduce false-positives, we opted for a very conservative error model in which error variance that is used for estimating F-statistics is the maximum of the actual error variance for this gene and the average error variance in 500 genes with similar average intensity. Statistical significance was determined using the False Discovery Rate (FDR = 10%) method (Benjamini and Hochberg, 1995). Pair-wise mean comparison was done with t-statistics and FDR=10%. Further data processing including scatter plots, hierarchical clustering, and principal component analysis (PCA) were also performed through NIA microarray analysis tool (http://lgsun.grc.nia.nih.gov/ANOVA/). The input file for NIA microarray analysis tool is available at http://lgsun.grc.nia.nih.gov/microarray/data.html
Project description:Feature Extraction 5.1.1 software (Agilent Technologies) performs background subtractions and dye normalization. This normalization method is targeted at detecting changes in relative expression of individual genes rather than global expression. Global expression change would require external normalization controls (van de Peppel et al., 2003). Text output was processed using an application developed in-house to perform ANOVA analysis (http://lgsun.grc.nia.nih.gov/ANOVA/). Intensity of features measured with a >50% error were replaced with missing values except features with very low intensity. Surrogate values equal to mean error were inserted for values that were negative or less than the probe error. Data were analyzed using ANOVA with embryonic stage as a factor. The small number of biological replications typical in expression profiling experiments results in a highly variable error variance, and this problem is usually addressed by log-ratio thresholds (Schena et al., 1995) that require subjective decisions about biological significance, or by Bayesian adjustment of error variance (Baldi and Long, 2001), which may still underestimate error variance and result in false positive results. To reduce false-positives, we opted for a very conservative error model in which error variance that is used for estimating F-statistics is the maximum of the actual error variance for this gene and the average error variance in 500 genes with similar average intensity. Statistical significance was determined using the False Discovery Rate (FDR = 10%) method (Benjamini and Hochberg, 1995). Pair-wise mean comparison was done with t-statistics and FDR=10%. Further data processing including scatter plots, hierarchical clustering, and principal component analysis (PCA) were also performed through NIA microarray analysis tool (http://lgsun.grc.nia.nih.gov/ANOVA/). Details are described in the publication (Hamatani T. et al., Developmental Cell. Published online Dec. 18, 2003).
Project description:Two subclasses of lung squamous cell carcinoma with different gene expression profiles and prognosis identified by hierarchical clustering and validated by non-negative matrix factorization . BACKGROUND: Current clinical and histopathological criteria used to define lung squamous cell carcinomas (SCCs) are insufficient to predict clinical outcome. We attempted to make a clinically-useful classification based on gene expression profiling. METHODS: We used cDNA microarrays with 40386 elements to analyze the gene expression profiles of 48 surgically resected samples of lung SCC. 9 samples of lung adenocarcinoma and 30 of normal lung were also included to give a total of 87 samples analyzed. After gene filtering, the data were subjected to hierarchical clustering and consensus clustering with the non-negative matrix factorization (NMF) approach. FINDINGS: Initial analysis by hierarchical clustering allowed division of SCCs into two distinct subclasses. An additional independent round of hierarchical clustering and consensus clustering with the NMF approach provided a validation for the classification. Kaplan-Meier analysis with the log rank test pointed to a non-significant difference in survival (p=0.071) but the likelihood of survival to 6 years was significantly different between the two groups (40.5% vs 81.8%, p=0.014, Z-test). Biological process categories characteristic for each subclass were identified statistically and up-regulation of cell-proliferation related genes was evident in the subclass with a poor prognosis. In the subclass with the better survival, genes involved in differentiated intracellular functions, such as the MAPKKK cascade, ceramide metabolism, or regulation of transcription, were up-regulated. Keywords: repeat sample
Project description:We aim to identify molecular subtypes of prostate cancer using consensus non-negative matrix factorization and correlate these with existing biomarkers to inform future immunotherapeutic strategies.
Project description:Multiple myeloma is a plasma cell malignancy almost always preceded by precursor conditions, but low tumor burden of these early stages has hindered the study of their molecular programs through bulk sequencing technologies. Here, we generated and analyzed single cell RNA-sequencing of plasma cells from 26 patients at varying disease stages and 9 healthy donors. In silico dissection and comparison of normal and transformed plasma cells from the same bone marrow biopsy enabled discovery of novel, patient-specific transcriptional changes. Using Bayesian Non-Negative Matrix Factorization, we discovered 15 gene expression signatures which represent transcriptional modules relevant to myeloma biology, and identified a signature that is uniformly lost in neoplastic cells across disease stages. Finally, we demonstrated that tumors contain heterogeneous subpopulations expressing distinct transcriptional patterns. Our findings characterize transcriptomic alterations present at the earliest stages of myeloma, paving the way for exploration of personalized treatment approaches prior to malignant disease.
Project description:Objective. Microarray analysis was used to determine whether children with recent onset polyarticular juvenile idiopathic arthritis (JIA) exhibit biologically or clinically informative gene expression signatures in peripheral blood mononuclear cells (PBMC). Methods. Peripheral blood samples were obtained from 59 healthy children and 61 children with polyarticular JIA prior to treatment with second-line medications, such as methotrexate or biological agents. RNA was purified from Ficoll-isolated mononuclear cells, fluorescently labeled and then hybridized to Affymetrix U133 Plus 2.0 GeneChips. Data were analyzed using ANOVA at a 5% false discovery rate threshold after Robust Multi-Array Average pre-processing and Distance Weighted Discrimination normalization. Results. Initial analysis revealed 873 probe sets for genes that were differentially expressed between polyarticular JIA and controls. Hierarchical clustering of these probe sets distinguished three subgroups within polyarticular JIA. Prototypical subjects within each subgroup were identified and used to define subgroup-specific gene expression signatures. One of these signatures was associated with monocyte markers, another with transforming growth factor-beta-inducible genes, and a third with immediate-early genes. Correlation of these gene expression signatures with clinical and biological features of JIA subgroups suggests direct relevance to aspects of disease activity and supports the division of polyarticular JIA into distinct subsets. Conclusions. PBMC gene expression signatures in recent onset polyarticular JIA reflect discrete disease processes and offer a molecular classification of disease. Keywords: Patient vs. control, reassessment of phenotype PBMC samples were obtained from 59 healthy children and 61 children with polyarticular JIA prior to treatment with second-line medications, such as methotrexate or biological agents. RNA was purified from Ficoll-isolated mononuclear cells, fluorescently labeled and then hybridized to Affymetrix U133 Plus 2.0 GeneChips. Data were analyzed using ANOVA at a 5% false discovery rate threshold after Robust Multi-Array Average pre-processing and Distance Weighted Discrimination normalization.
Project description:Two subclasses of lung squamous cell carcinoma with different gene expression profiles and prognosis identified by hierarchical clustering and validated by non-negative matrix factorization . BACKGROUND: Current clinical and histopathological criteria used to define lung squamous cell carcinomas (SCCs) are insufficient to predict clinical outcome. We attempted to make a clinically-useful classification based on gene expression profiling. METHODS: We used cDNA microarrays with 40386 elements to analyze the gene expression profiles of 48 surgically resected samples of lung SCC. 9 samples of lung adenocarcinoma and 30 of normal lung were also included to give a total of 87 samples analyzed. After gene filtering, the data were subjected to hierarchical clustering and consensus clustering with the non-negative matrix factorization (NMF) approach. FINDINGS: Initial analysis by hierarchical clustering allowed division of SCCs into two distinct subclasses. An additional independent round of hierarchical clustering and consensus clustering with the NMF approach provided a validation for the classification. Kaplan-Meier analysis with the log rank test pointed to a non-significant difference in survival (p=0.071) but the likelihood of survival to 6 years was significantly different between the two groups (40.5% vs 81.8%, p=0.014, Z-test). Biological process categories characteristic for each subclass were identified statistically and up-regulation of cell-proliferation related genes was evident in the subclass with a poor prognosis. In the subclass with the better survival, genes involved in differentiated intracellular functions, such as the MAPKKK cascade, ceramide metabolism, or regulation of transcription, were up-regulated. Keywords: repeat sample
Project description:We used whole genome microarray expression profiling as a discovery platform to identify high grade diffuse glioma associated differently expressed genes comparing with low grade diffuse glioma.
Project description:Prior studies have described the complex interplay that exists between glioma cells and neurons, however, the electrophysiological properties endogenous to tumor cells remain obscure. To address this, we employed Patch-sequencing on human glioma specimens and found that one third of patched cells in IDH mutant (IDHmut) tumors demonstrate properties of both neurons and glia by firing single, short action potentials. To define these hybrid cells (HCs) and discern if they are tumor in origin, we developed a computational tool, Single Cell Rule Association Mining (SCRAM), to annotate each cell individually. SCRAM revealed that HCs represent tumor and non-tumor cells that have select features of GABAergic neurons and oligodendrocyte precursor cells. These studies are the first to characterize the combined electrophysiological and molecular properties of human glioma cells and describe a new cell type in human glioma with unique electrophysiological and transcriptomic properties that may also exist in the non-tumor brain.