Browse
Submit Data
Databases
API
Help

Dataset Information

14 Views

0 Connections

0 Citations

0 Reanalyses

0 Downloads

Omics score: 0

Comparison of Methods for Feature Selection in Clustering of High-Dimensional RNA-Sequencing Data to Identify Cancer Subtypes.

ABSTRACT: Cancer subtype identification is important to facilitate cancer diagnosis and select effective treatments. Clustering of cancer patients based on high-dimensional RNA-sequencing data can be used to detect novel subtypes, but only a subset of the features (e.g., genes) contains information related to the cancer subtype. Therefore, it is reasonable to assume that the clustering should be based on a set of carefully selected features rather than all features. Several feature selection methods have been proposed, but how and when to use these methods are still poorly understood. Thirteen feature selection methods were evaluated on four human cancer data sets, all with known subtypes (gold standards), which were only used for evaluation. The methods were characterized by considering mean expression and standard deviation (SD) of the selected genes, the overlap with other methods and their clustering performance, obtained comparing the clustering result with the gold standard using the adjusted Rand index (ARI). The results were compared to a supervised approach as a positive control and two negative controls in which either a random selection of genes or all genes were included. For all data sets, the best feature selection approach outperformed the negative control and for two data sets the gain was substantial with ARI increasing from (-0.01, 0.39) to (0.66, 0.72), respectively. No feature selection method completely outperformed the others but using the dip-rest statistic to select 1000 genes was overall a good choice. The commonly used approach, where genes with the highest SDs are selected, did not perform well in our study.

SUBMITTER: Kallberg D

PROVIDER: S-EPMC7943624 | biostudies-literature |

REPOSITORIES: biostudies-literature

ACCESS DATA

Json Xml

Similar Datasets

Clustering high-dimensional data via feature selection.

Project description:High-dimensional clustering analysis is a challenging problem in statistics and machine learning, with broad applications such as the analysis of microarray data and RNA-seq data. In this paper, we propose a new clustering procedure called spectral clustering with feature selection (SC-FS), where we first obtain an initial estimate of labels via spectral clustering, then select a small fraction of features with the largest R-squared with these labels, that is, the proportion of variation explained by group labels, and conduct clustering again using selected features. Under mild conditions, we prove that the proposed method identifies all informative features with high probability and achieves the minimax optimal clustering error rate for the sparse Gaussian mixture model. Applications of SC-FS to four real-world datasets demonstrate its usefulness in clustering high-dimensional data.

| S-EPMC10119907 | biostudies-literature

Accurate feature selection improves single-cell RNA-seq cell clustering.

Project description:Cell clustering is one of the most important and commonly performed tasks in single-cell RNA sequencing (scRNA-seq) data analysis. An important step in cell clustering is to select a subset of genes (referred to as 'features'), whose expression patterns will then be used for downstream clustering. A good set of features should include the ones that distinguish different cell types, and the quality of such set could have a significant impact on the clustering accuracy. All existing scRNA-seq clustering tools include a feature selection step relying on some simple unsupervised feature selection methods, mostly based on the statistical moments of gene-wise expression distributions. In this work, we carefully evaluate the impact of feature selection on cell clustering accuracy. In addition, we develop a feature selection algorithm named FEAture SelecTion (FEAST), which provides more representative features. We apply the method on 12 public scRNA-seq datasets and demonstrate that using features selected by FEAST with existing clustering tools significantly improve the clustering accuracy.

| S-EPMC8644062 | biostudies-literature

Sparse feature selection methods identify unexpected global cellular response to strontium-containing materials

Project description:Despite the increasing sophistication of biomaterials design and functional characterization studies, little is known regarding cells' global response to biomaterials. Here, we combined nontargeted holistic biological and physical science techniques to evaluate how simple strontium ion incorporation within the well-described biomaterial 45S5 bioactive glass (BG) influences the global response of human mesenchymal stem cells. Our objective analyses of whole gene-expression profiles, confirmed by standard molecular biology techniques, revealed that strontium-substituted BG up-regulated the isoprenoid pathway, suggesting an influence on both sterol metabolite synthesis and protein prenylation processes. This up-regulation was accompanied by increases in cellular and membrane cholesterol and lipid raft contents as determined by Raman spectroscopy mapping and total internal reflection fluorescence microscopy analyses and by an increase in cellular content of phosphorylated myosin II light chain. Our unexpected findings of this strong metabolic pathway regulation as a response to biomaterial composition highlight the benefits of discovery-driven nonreductionist approaches to gain a deeper understanding of global cell-material interactions and suggest alternative research routes for evaluating biomaterials to improve their design.

2018-11-25 | E-MTAB-7384 | biostudies-arrayexpress

A comparison of marker gene selection methods for single-cell RNA sequencing data.

Project description:BackgroundThe development of single-cell RNA sequencing (scRNA-seq) has enabled scientists to catalog and probe the transcriptional heterogeneity of individual cells in unprecedented detail. A common step in the analysis of scRNA-seq data is the selection of so-called marker genes, most commonly to enable annotation of the biological cell types present in the sample. In this paper, we benchmark 59 computational methods for selecting marker genes in scRNA-seq data.ResultsWe compare the performance of the methods using 14 real scRNA-seq datasets and over 170 additional simulated datasets. Methods are compared on their ability to recover simulated and expert-annotated marker genes, the predictive performance and characteristics of the gene sets they select, their memory usage and speed, and their implementation quality. In addition, various case studies are used to scrutinize the most commonly used methods, highlighting issues and inconsistencies.ConclusionsOverall, we present a comprehensive evaluation of methods for selecting marker genes in scRNA-seq data. Our results highlight the efficacy of simple methods, especially the Wilcoxon rank-sum test, Student's t-test, and logistic regression.

| S-EPMC10895860 | biostudies-literature

scFseCluster: a feature selection-enhanced clustering for single-cell RNA-seq data.

Project description:Single-cell RNA sequencing (scRNA-seq) enables researchers to reveal previously unknown cell heterogeneity and functional diversity, which is impossible with bulk RNA sequencing. Clustering approaches are widely used for analyzing scRNA-seq data and identifying cell types and states. In the past few years, various advanced computational strategies emerged. However, the low generalization and high computational cost are the main bottlenecks of existing methods. In this study, we established a novel computational framework, scFseCluster, for scRNA-seq clustering analysis. scFseCluster incorporates a metaheuristic algorithm (Feature Selection based on Quantum Squirrel Search Algorithm) to extract the optimal gene set, which largely guarantees the performance of cell clustering. We conducted simulation experiments in several aspects to verify the performance of the proposed approach. scFseCluster performed very well on eight benchmark scRNA-seq datasets because of the optimal gene sets obtained using the Feature Selection based on Quantum Squirrel Search Algorithm. The comparative study demonstrated the significant advantages of scFseCluster over seven State-of-the-Art algorithms. In addition, our analysis shows that feature selection on high-variable genes can significantly improve clustering performance. In conclusion, our study demonstrates that scFseCluster is a highly versatile tool for enhancing scRNA-seq data clustering analysis.

| S-EPMC10547911 | biostudies-literature

Bayesian network-driven clustering analysis with feature selection for high-dimensional multi-modal molecular data.

Project description:Multi-modal molecular profiling data in bulk tumors or single cells are accumulating at a fast pace. There is a great need for developing statistical and computational methods to reveal molecular structures in complex data types toward biological discoveries. Here, we introduce Nebula, a novel Bayesian integrative clustering analysis for high dimensional multi-modal molecular data to identify directly interpretable clusters and associated biomarkers in a unified and biologically plausible framework. To facilitate computational efficiency, a variational Bayes approach is developed to approximate the joint posterior distribution to achieve model inference in high-dimensional settings. We describe a pan-cancer data analysis of genomic, epigenomic, and transcriptomic alterations in close to 9000 tumor samples across canonical oncogenic signaling pathways, immune and stemness phenotype, with comparisons to state-of-the-art clustering methods. We demonstrate that Nebula has the unique advantage of revealing patterns on the basis of shared pathway alterations, offering biological and clinical insights beyond tumor type and histology in the pan-cancer analysis setting. We also illustrate the utility of Nebula in single cell data for immune cell decomposition in peripheral blood samples.

| S-EPMC7933297 | biostudies-literature

Sparse feature selection methods identify unexpected global cellular response to strontium-containing materials.

| S-EPMC4394289 | biostudies-literature

Clustering and classification methods for single-cell RNA-sequencing data.

Project description:Appropriate ways to measure the similarity between single-cell RNA-sequencing (scRNA-seq) data are ubiquitous in bioinformatics, but using single clustering or classification methods to process scRNA-seq data is generally difficult. This has led to the emergence of integrated methods and tools that aim to automatically process specific problems associated with scRNA-seq data. These approaches have attracted a lot of interest in bioinformatics and related fields. In this paper, we systematically review the integrated methods and tools, highlighting the pros and cons of each approach. We not only pay particular attention to clustering and classification methods but also discuss methods that have emerged recently as powerful alternatives, including nonlinear and linear methods and descending dimension methods. Finally, we focus on clustering and classification methods for scRNA-seq data, in particular, integrated methods, and provide a comprehensive description of scRNA-seq data and download URLs.

| S-EPMC7444317 | biostudies-literature

Long Non-Coding RNA Landscape in Prostate Cancer Molecular Subtypes: A Feature Selection Approach.

Project description:Prostate cancer is one of the most common malignancies in men. It is characterized by a high molecular genomic heterogeneity and, thus, molecular subtypes, that, to date, have not been used in clinical practice. In the present paper, we aimed to better stratify prostate cancer patients through the selection of robust long non-coding RNAs. To fulfill the purpose of the study, a bioinformatic approach focused on feature selection applied to a TCGA dataset was used. In such a way, LINC00668 and long non-coding(lnc)-SAYSD1-1, able to discriminate ERG/not-ERG subtypes, were demonstrated to be positive prognostic biomarkers in ERG-positive patients. Furthermore, we performed a comparison between mutated prostate cancer, identified as "classified", and a group of patients with no peculiar genomic alteration, named "not-classified". Moreover, LINC00920 lncRNA overexpression has been linked to a better outcome of the hormone regimen. Through the feature selection approach, it was found that the overexpression of lnc-ZMAT3-3 is related to low-grade patients, and three lncRNAs: lnc-SNX10-87, lnc-AP1S2-2, and ADPGK-AS1 showed, through a co-expression analysis, significant correlation values with potentially druggable pathways. In conclusion, the data mining of publicly available data and robust bioinformatic analyses are able to explore the unknown biology of malignancies.

| S-EPMC7926489 | biostudies-literature

Feature selection methods for identifying genetic determinants of host species in RNA viruses.

Project description:Despite environmental, social and ecological dependencies, emergence of zoonotic viruses in human populations is clearly also affected by genetic factors which determine cross-species transmission potential. RNA viruses pose an interesting case study given their mutation rates are orders of magnitude higher than any other pathogen--as reflected by the recent emergence of SARS and Influenza for example. Here, we show how feature selection techniques can be used to reliably classify viral sequences by host species, and to identify the crucial minority of host-specific sites in pathogen genomic data. The variability in alleles at those sites can be translated into prediction probabilities that a particular pathogen isolate is adapted to a given host. We illustrate the power of these methods by: 1) identifying the sites explaining SARS coronavirus differences between human, bat and palm civet samples; 2) showing how cross species jumps of rabies virus among bat populations can be readily identified; and 3) de novo identification of likely functional influenza host discriminant markers.

| S-EPMC3794897 | biostudies-literature

OmicsDI is part of the ELIXIR infrastructure

OmicsDI is an Elixir interoperability service. Learn more ›

Tweets

OmicsDI Databases

PRIDE
PeptideAtlas
MassIVE
JPOST Repository
Physiome Model Repository

EGA
EVA
ENA
LINCS
PAXDB
Cell Collective

MetaboLights
Metabolomics Workbench
MetabolomeExpress
GNPS
BioModels
FAIRDOMHub

ArrayExpress
dbGaP
ExpressionAtlas
GEO
NODE

Information

Databases
Help
API
Contact us
Code on GitHub
Terms of use
Submit Data