Dataset Information

Identification of gene signatures from RNA-seq data using Pareto-optimal cluster algorithm.

ABSTRACT:

Background

Gene signatures are important to represent the molecular changes in the disease genomes or the cells in specific conditions, and have been often used to separate samples into different groups for better research or clinical treatment. While many methods and applications have been available in literature, there still lack powerful ones that can take account of the complex data and detect the most informative signatures.

Methods

In this article, we present a new framework for identifying gene signatures using Pareto-optimal cluster size identification for RNA-seq data. We first performed pre-filtering steps and normalization, then utilized the empirical Bayes test in Limma package to identify the differentially expressed genes (DEGs). Next, we used a multi-objective optimization technique, "Multi-objective optimization for collecting cluster alternatives" (MOCCA in R package) on these DEGs to find Pareto-optimal cluster size, and then applied k-means clustering to the RNA-seq data based on the optimal cluster size. The best cluster was obtained through computing the average Spearman's Correlation Score among all the genes in pair-wise manner belonging to the module. The best cluster is treated as the signature for the respective disease or cellular condition.

Results

We applied our framework to a cervical cancer RNA-seq dataset, which included 253 squamous cell carcinoma (SCC) samples and 22 adenocarcinoma (ADENO) samples. We identified a total of 582 DEGs by Limma analysis of SCC versus ADENO samples. Among them, 260 are up-regulated genes and 322 are down-regulated genes. Using MOCCA, we obtained seven Pareto-optimal clusters. The best cluster has a total of 35 DEGs consisting of all-upregulated genes. For validation, we ran PAMR (prediction analysis for microarrays) classifier on the selected best cluster, and assessed the classification performance. Our evaluation, measured by sensitivity, specificity, precision, and accuracy, showed high confidence.

Conclusions

Our framework identified a multi-objective based cluster that is treated as a signature that can classify the disease and control group of samples with higher classification performance (accuracy 0.935) for the corresponding disease. Our method is useful to find signature for any RNA-seq or microarray data.

SUBMITTER: Mallik S

PROVIDER: S-EPMC6302366 | biostudies-literature | 2018 Dec

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Identification of gene signatures from RNA-seq data using Pareto-optimal cluster algorithm.

Mallik Saurav S Zhao Zhongming Z

BMC systems biology 20181221 Suppl 8

<h4>Background</h4>Gene signatures are important to represent the molecular changes in the disease genomes or the cells in specific conditions, and have been often used to separate samples into different groups for better research or clinical treatment. While many methods and applications have been available in literature, there still lack powerful ones that can take account of the complex data and detect the most informative signatures.<h4>Methods</h4>In this article, we present a new framework ...[more]

PMID: 30577846

Dataset Information

Identification of gene signatures from RNA-seq data using Pareto-optimal cluster algorithm.

Background

Methods

Results

Conclusions

Publications

Identification of gene signatures from RNA-seq data using Pareto-optimal cluster algorithm.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

deFuse: an algorithm for gene fusion discovery in tumor RNA-Seq data.
| S-EPMC3098195 | biostudies-literature

Detecting Multivariate Gene Interactions in RNA-Seq Data Using Optimal Bayesian Classification.
| S-EPMC4818202 | biostudies-other

Differential expression analysis using a model-based gene clustering algorithm for RNA-seq data.
| S-EPMC8527798 | biostudies-literature

A graph-based algorithm for RNA-seq data normalization.
| S-EPMC6980396 | biostudies-literature

An Efficient Algorithm for Sensitively Detecting Circular RNA from RNA-seq Data.
| S-EPMC6213952 | biostudies-literature

Identification of Pathogen Signatures in Prostate Cancer Using RNA-seq.
| S-EPMC4460021 | biostudies-literature

ARH-seq: identification of differential splicing in RNA-seq data.
| S-EPMC4132698 | biostudies-literature

Neural arbors are Pareto optimal.
| S-EPMC6532510 | biostudies-literature

CIARA: a cluster independent algorithm for the identification of rare cell types from single cell RNA seq data
2023-05-21 | E-MTAB-11610 | biostudies-arrayexpress

Multi-class clustering of cancer subtypes through SVM based ensemble of pareto-optimal solutions for gene marker identification.
| S-EPMC2980474 | biostudies-literature