Dataset Information

An efficient concordant integrative analysis of multiple large-scale two-sample expression data sets.

ABSTRACT:

Motivation

We have proposed a mixture model based approach to the concordant integrative analysis of multiple large-scale two-sample expression datasets. Since the mixture model is based on the transformed differential expression test P-values (z-scores), it is generally applicable to the expression data generated by either microarray or RNA-seq platforms. The mixture model is simple with three normal distribution components for each dataset to represent down-regulation, up-regulation and no differential expression. However, when the number of datasets increases, the model parameter space increases exponentially due to the component combination from different datasets.

Results

In this study, motivated by the well-known generalized estimating equations (GEEs) for longitudinal data analysis, we focus on the concordant components and assume that the proportions of non-concordant components follow a special structure. We discuss the exchangeable, multiset coefficient and autoregressive structures for model reduction, and their related expectation-maximization (EM) algorithms. Then, the parameter space is linear with the number of datasets. In our previous study, we have applied the general mixture model to three microarray datasets for lung cancer studies. We show that more gene sets (or pathways) can be detected by the reduced mixture model with the exchangeable structure. Furthermore, we show that more genes can also be detected by the reduced model. The Cancer Genome Atlas (TCGA) data have been increasingly collected. The advantage of incorporating the concordance feature has also been clearly demonstrated based on TCGA RNA sequencing data for studying two closely related types of cancer.

Availability and implementation

Additional results are included in a supplemental file. Computer program R-functions are freely available at http://home.gwu.edu/?ylai/research/Concordance.

Contact

ylai@gwu.edu.

Supplementary information

Supplementary data are available at Bioinformatics online.

SUBMITTER: Lai Y

PROVIDER: S-EPMC5860313 | biostudies-literature | 2017 Dec

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

An efficient concordant integrative analysis of multiple large-scale two-sample expression data sets.

Lai Yinglei Y Zhang Fanni F Nayak Tapan K TK Modarres Reza R Lee Norman H NH McCaffrey Timothy A TA

Bioinformatics (Oxford, England) 20171201 23

<h4>Motivation</h4>We have proposed a mixture model based approach to the concordant integrative analysis of multiple large-scale two-sample expression datasets. Since the mixture model is based on the transformed differential expression test P-values (z-scores), it is generally applicable to the expression data generated by either microarray or RNA-seq platforms. The mixture model is simple with three normal distribution components for each dataset to represent down-regulation, up-regulation an ...[more]

PMID: 28174897

Dataset Information

An efficient concordant integrative analysis of multiple large-scale two-sample expression data sets.

Motivation

Results

Availability and implementation

Contact

Supplementary information

Publications

An efficient concordant integrative analysis of multiple large-scale two-sample expression data sets.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

Efficient algorithms for fast integration on large data sets from multiple sources.
| S-EPMC3439324 | biostudies-literature

MacroSEQUEST: efficient candidate-centric searching and high-resolution correlation analysis for large-scale proteomics data sets.
| S-EPMC2925463 | biostudies-literature

Parallel clustering algorithm for large-scale biological data sets.
| S-EPMC3976248 | biostudies-literature

STEME: efficient EM to find motifs in large data sets.
| S-EPMC3185442 | biostudies-literature

Efficient genotype compression and analysis of large genetic-variation data sets.
| S-EPMC4697868 | biostudies-literature

KEGG for integration and interpretation of large-scale molecular data sets.
| S-EPMC3245020 | biostudies-literature

Machine Learning Adaptive Basis Sets for Efficient Large Scale Density Functional Theory Simulation.
| S-EPMC6096449 | biostudies-literature

Efficient analysis of large-scale genome-wide data with two R packages: bigstatsr and bigsnpr.
| S-EPMC6084588 | biostudies-literature

Detecting Sources of Transcriptional Heterogeneity in Large-Scale RNA-Seq Data Sets.
| S-EPMC5161273 | biostudies-literature

MOGSA: Integrative Single Sample Gene-set Analysis of Multiple Omics Data.
| S-EPMC6692785 | biostudies-literature