Dataset Information

PCAN: Probabilistic correlation analysis of two non-normal data sets.

ABSTRACT: Most cancer research now involves one or more assays profiling various biological molecules, e.g., messenger RNA and micro RNA, in samples collected on the same individuals. The main interest with these genomic data sets lies in the identification of a subset of features that are active in explaining the dependence between platforms. To quantify the strength of the dependency between two variables, correlation is often preferred. However, expression data obtained from next-generation sequencing platforms are integer with very low counts for some important features. In this case, the sample Pearson correlation is not a valid estimate of the true correlation matrix, because the sample correlation estimate between two features/variables with low counts will often be close to zero, even when the natural parameters of the Poisson distribution are, in actuality, highly correlated. We propose a model-based approach to correlation estimation between two non-normal data sets, via a method we call Probabilistic Correlations ANalysis, or PCAN. PCAN takes into consideration the distributional assumption about both data sets and suggests that correlations estimated at the model natural parameter level are more appropriate than correlations estimated directly on the observed data. We demonstrate through a simulation study that PCAN outperforms other standard approaches in estimating the true correlation between the natural parameters. We then apply PCAN to the joint analysis of a microRNA (miRNA) and a messenger RNA (mRNA) expression data set from a squamous cell lung cancer study, finding a large number of negative correlation pairs when compared to the standard approaches.

SUBMITTER: Zoh RS

PROVIDER: S-EPMC5045754 | biostudies-literature | 2016 Dec

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

PCAN: Probabilistic correlation analysis of two non-normal data sets.

Zoh Roger S RS Mallick Bani B Ivanov Ivan I Baladandayuthapani Veera V Manyam Ganiraju G Chapkin Robert S RS Lampe Johanna W JW Carroll Raymond J RJ

Biometrics 20160401 4

Most cancer research now involves one or more assays profiling various biological molecules, e.g., messenger RNA and micro RNA, in samples collected on the same individuals. The main interest with these genomic data sets lies in the identification of a subset of features that are active in explaining the dependence between platforms. To quantify the strength of the dependency between two variables, correlation is often preferred. However, expression data obtained from next-generation sequencing ...[more]

PMID: 27037601

Similar Datasets

Project description:MotivationWe have proposed a mixture model based approach to the concordant integrative analysis of multiple large-scale two-sample expression datasets. Since the mixture model is based on the transformed differential expression test P-values (z-scores), it is generally applicable to the expression data generated by either microarray or RNA-seq platforms. The mixture model is simple with three normal distribution components for each dataset to represent down-regulation, up-regulation and no differential expression. However, when the number of datasets increases, the model parameter space increases exponentially due to the component combination from different datasets.ResultsIn this study, motivated by the well-known generalized estimating equations (GEEs) for longitudinal data analysis, we focus on the concordant components and assume that the proportions of non-concordant components follow a special structure. We discuss the exchangeable, multiset coefficient and autoregressive structures for model reduction, and their related expectation-maximization (EM) algorithms. Then, the parameter space is linear with the number of datasets. In our previous study, we have applied the general mixture model to three microarray datasets for lung cancer studies. We show that more gene sets (or pathways) can be detected by the reduced mixture model with the exchangeable structure. Furthermore, we show that more genes can also be detected by the reduced model. The Cancer Genome Atlas (TCGA) data have been increasingly collected. The advantage of incorporating the concordance feature has also been clearly demonstrated based on TCGA RNA sequencing data for studying two closely related types of cancer.Availability and implementationAdditional results are included in a supplemental file. Computer program R-functions are freely available at http://home.gwu.edu/?ylai/research/Concordance.Contactylai@gwu.edu.Supplementary informationSupplementary data are available at Bioinformatics online.

Dataset Information

PCAN: Probabilistic correlation analysis of two non-normal data sets.

Publications

PCAN: Probabilistic correlation analysis of two non-normal data sets.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets