Browse
Submit Data
Databases
API
Help

Dataset Information

52 Views

0 Connections

0 Citations

0 Reanalyses

0 Downloads

Omics score: 0

Statistically invalid classification of high throughput gene expression data.

ABSTRACT: Classification analysis based on high throughput data is a common feature in neuroscience and other fields of science, with a rapidly increasing impact on both basic biology and disease-related studies. The outcome of such classifications often serves to delineate novel biochemical mechanisms in health and disease states, identify new targets for therapeutic interference, and develop innovative diagnostic approaches. Given the importance of this type of studies, we screened 111 recently-published high-impact manuscripts involving classification analysis of gene expression, and found that 58 of them (53%) based their conclusions on a statistically invalid method which can lead to bias in a statistical sense (lower true classification accuracy then the reported classification accuracy). In this report we characterize the potential methodological error and its scope, investigate how it is influenced by different experimental parameters, and describe statistically valid methods for avoiding such classification mistakes.

SUBMITTER: Barbash S

PROVIDER: S-EPMC3551228 | biostudies-other | 2013

REPOSITORIES: biostudies-other

ACCESS DATA

Json Xml

Similar Datasets

ScanGEO: parallel mining of high-throughput gene expression data.

Project description:SummaryCurrent options to mine publicly available gene expression data deposited in NCBI's gene expression omnibus (GEO), such as the GEO web portal and related applications, are optimized to reanalyze a single study, or search for a single gene, and therefore require manual intervention to reanalyze multiple studies for user-specified gene sets. ScanGEO is a simple, user-friendly Shiny web application designed to identify differentially expressed genes across all GEO studies matching user-specified criteria, for a flexible set of genes, visualize results and provide summary statistics and other reports using a single command.Availability and implementationThe ScanGEO source code is written in R and implemented as a Shiny app that can be freely accessed at http://scangeo.dartmouth.edu/ScanGEO/. For users who would like to run a local instantiation of the app, the R source code is available under a GNU GPLv3 license at https://github.com/StantonLabDartmouth/AppScanGEO.Contactkatja.koeppen@dartmouth.edu.Supplementary informationSupplementary data are available at Bioinformatics online.

| S-EPMC5860173 | biostudies-literature

Detection call algorithms for high-throughput gene expression microarray data.

Project description:Extensive methodological research has been conducted to improve gene expression summary methods. However, in addition to quantitative gene expression summaries, most platforms, including all those examined in the MicroArray Quality Control project, provide a qualitative detection call result for each gene on the platform. These detection call algorithms are intended to render an assessment of whether or not each transcript is reliably measured. In this paper, we review uses of these qualitative detection call results in the analysis of microarray data. We also review the detection call algorithms for two widely used gene expression microarray platforms, Affymetrix GeneChips and Illumina BeadArrays, and more clearly formalize the mathematical notation for the Illumina BeadArray detection call algorithm. Both algorithms result in a P-value which is then used for determining the qualitative detection calls. We examined the performance of these detection call algorithms and default parameters by applying the methods to two spike-in datasets. We show that the default parameters for qualitative detection calls yield few absent calls for high spike-in concentrations. When genes of interest are expected to be present at very low concentrations, spike-in datasets can be useful for appropriately adjusting the tuning parameters for qualitative detection calls.

| S-EPMC4110453 | biostudies-literature

nEASE: a method for gene ontology subclassification of high-throughput gene expression data.

Project description:UnlabelledHigh-throughput technologies can identify genes whose expression profiles correlate with specific phenotypes; however, placing these genes into a biological context remains challenging. To help address this issue, we developed nested Expression Analysis Systematic Explorer (nEASE). nEASE complements traditional gene ontology enrichment approaches by determining statistically enriched gene ontology subterms within a list of genes based on co-annotation. Here, we overview an open-source software version of the nEASE algorithm. nEASE can be used either stand-alone or as part of a pathway discovery pipeline.AvailabilitynEASE is implemented within the Multiple Experiment Viewer software package available at http://www.tm4.org/mev.Supplementary informationSupplementary data are available at Bioinformatics online.

| S-EPMC6903781 | biostudies-literature

Statistical Approach for Biologically Relevant Gene Selection from High-Throughput Gene Expression Data.

Project description:Selection of biologically relevant genes from high-dimensional expression data is a key research problem in gene expression genomics. Most of the available gene selection methods are either based on relevancy or redundancy measure, which are usually adjudged through post selection classification accuracy. Through these methods the ranking of genes was conducted on a single high-dimensional expression data, which led to the selection of spuriously associated and redundant genes. Hence, we developed a statistical approach through combining a support vector machine with Maximum Relevance and Minimum Redundancy under a sound statistical setup for the selection of biologically relevant genes. Here, the genes were selected through statistical significance values and computed using a nonparametric test statistic under a bootstrap-based subject sampling model. Further, a systematic and rigorous evaluation of the proposed approach with nine existing competitive methods was carried on six different real crop gene expression datasets. This performance analysis was carried out under three comparison settings, i.e., subject classification, biological relevant criteria based on quantitative trait loci and gene ontology. Our analytical results showed that the proposed approach selects genes which are more biologically relevant as compared to the existing methods. Moreover, the proposed approach was also found to be better with respect to the competitive existing methods. The proposed statistical approach provides a framework for combining filter and wrapper methods of gene selection.

| S-EPMC7712650 | biostudies-literature

GeneCloudOmics: A Data Analytic Cloud Platform for High-Throughput Gene Expression Analysis.

Project description:Gene expression profiling techniques, such as DNA microarray and RNA-Sequencing, have provided significant impact on our understanding of biological systems. They contribute to almost all aspects of biomedical research, including studying developmental biology, host-parasite relationships, disease progression and drug effects. However, the high-throughput data generations present challenges for many wet experimentalists to analyze and take full advantage of such rich and complex data. Here we present GeneCloudOmics, an easy-to-use web server for high-throughput gene expression analysis that extends the functionality of our previous ABioTrans with several new tools, including protein datasets analysis, and a web interface. GeneCloudOmics allows both microarray and RNA-Seq data analysis with a comprehensive range of data analytics tools in one package that no other current standalone software or web-based tool can do. In total, GeneCloudOmics provides the user access to 23 different data analytical and bioinformatics tasks including reads normalization, scatter plots, linear/non-linear correlations, PCA, clustering (hierarchical, k-means, t-SNE, SOM), differential expression analyses, pathway enrichments, evolutionary analyses, pathological analyses, and protein-protein interaction (PPI) identifications. Furthermore, GeneCloudOmics allows the direct import of gene expression data from the NCBI Gene Expression Omnibus database. The user can perform all tasks rapidly through an intuitive graphical user interface that overcomes the hassle of coding, installing tools/packages/libraries and dealing with operating systems compatibility and version issues, complications that make data analysis tasks challenging for biologists. Thus, GeneCloudOmics is a one-stop open-source tool for gene expression data analysis and visualization. It is freely available at http://combio-sifbi.org/GeneCloudOmics.

| S-EPMC9581002 | biostudies-literature

Automation of gene assignments to metabolic pathways using high-throughput expression data.

Project description:BackgroundAccurate assignment of genes to pathways is essential in order to understand the functional role of genes and to map the existing pathways in a given genome. Existing algorithms predict pathways by extrapolating experimental data in one organism to other organisms for which this data is not available. However, current systems classify all genes that belong to a specific EC family to all the pathways that contain the corresponding enzymatic reaction, and thus introduce ambiguity.ResultsHere we describe an algorithm for assignment of genes to cellular pathways that addresses this problem by selectively assigning specific genes to pathways. Our algorithm uses the set of experimentally elucidated metabolic pathways from MetaCyc, together with statistical models of enzyme families and expression data to assign genes to enzyme families and pathways by optimizing correlated co-expression, while minimizing conflicts due to shared assignments among pathways. Our algorithm also identifies alternative ("backup") genes and addresses the multi-domain nature of proteins. We apply our model to assign genes to pathways in the Yeast genome and compare the results for genes that were assigned experimentally. Our assignments are consistent with the experimentally verified assignments and reflect characteristic properties of cellular pathways.ConclusionWe present an algorithm for automatic assignment of genes to metabolic pathways. The algorithm utilizes expression data and reduces the ambiguity that characterizes assignments that are based only on EC numbers.

| S-EPMC1239907 | biostudies-literature

SAMQA: error classification and validation of high-throughput sequenced read data.

Project description:BackgroundThe advances in high-throughput sequencing technologies and growth in data sizes has highlighted the need for scalable tools to perform quality assurance testing. These tests are necessary to ensure that data is of a minimum necessary standard for use in downstream analysis. In this paper we present the SAMQA tool to rapidly and robustly identify errors in population-scale sequence data.ResultsSAMQA has been used on samples from three separate sets of cancer genome data from The Cancer Genome Atlas (TCGA) project. Using technical standards provided by the SAM specification and biological standards defined by researchers, we have classified errors in these sequence data sets relative to individual reads within a sample. Due to an observed linearithmic speedup through the use of a high-performance computing (HPC) framework for the majority of tasks, poor quality data was identified prior to secondary analysis in significantly less time on the HPC framework than the same data run using alternative parallelization strategies on a single server.ConclusionsThe SAMQA toolset validates a minimum set of data quality standards across whole-genome and exome sequences. It is tuned to run on a high-performance computational framework, enabling QA across hundreds gigabytes of samples regardless of coverage or sample type.

| S-EPMC3170309 | biostudies-literature

Deep learning approach for cancer subtype classification using high-dimensional gene expression data.

Project description:MotivationStudies have shown that classifying cancer subtypes can provide valuable information for a range of cancer research, from aetiology and tumour biology to prognosis and personalized treatment. Current methods usually adopt gene expression data to perform cancer subtype classification. However, cancer samples are scarce, and the high-dimensional features of their gene expression data are too sparse to allow most methods to achieve desirable classification results.ResultsIn this paper, we propose a deep learning approach by combining a convolutional neural network (CNN) and bidirectional gated recurrent unit (BiGRU): our approach, DCGN, aims to achieve nonlinear dimensionality reduction and learn features to eliminate irrelevant factors in gene expression data. Specifically, DCGN first uses the synthetic minority oversampling technique algorithm to equalize data. The CNN can handle high-dimensional data without stress and extract important local features, and the BiGRU can analyse deep features and retain their important information; the DCGN captures key features by combining both neural networks to overcome the challenges of small sample sizes and sparse, high-dimensional features. In the experiments, we compared the DCGN to seven other cancer subtype classification methods using breast and bladder cancer gene expression datasets. The experimental results show that the DCGN performs better than the other seven methods and can provide more satisfactory classification results.

| S-EPMC9575247 | biostudies-literature

Integrative analysis of BSG expression in NPC through immunohistochemistry and public high-throughput gene expression data.

Project description:BackgroundThough basigin (BSG) was reported to be overexpressed in nasopharyngeal carcinoma (NPC) and correlate with the development of NPC, the molecular basis of BSG in NPC remained elusive. The aim of the research was to investigate BSG expression in NPC and the potential molecular mechanism underlying it.Materials and methodsBSG expression in NPC tissues was detected with immunohistochemistry. Chi-square test, Kruskal-Wallis test and Spearman correlation test were performed to examine the relationship between BSG expression and the clinico-pathological features as well as EGFR and P-53 expression in NPC. In addition, data from the Human Protein Atlas (HPA) database and oncomine were collected to validate BSG expression in NPC. Meta-analysis was conducted to investigate the association between BSG expression and the clinico-pathological variables of NPC. The prognostic value and the alteration of BSG gene status were also analyzed with data from The Cancer Genome Atlas (TCGA).ResultsBSG presented notably higher expression in NPC tissues than in non-cancer tissues. Moreover, IHC results showed that BSG expression was significantly correlated with tumor progression. A positive correlation was also found between BSG expression and EGFR, P53 expression. Meta-analysis confirmed that BSG was indicative of lymph node metastasis and TNM stage in NPC. Additionally, data from cBioPortal indicated that alteration of BSG gene existed in 5% of NPC cases and BSG correlative genes were obtained from the Co-expression Analysis in TCGA.ConclusionBSG was overexpressed in NPC and might have an oncogenic effect on the tumorigenesis and progression of NPC.

| S-EPMC5666066 | biostudies-literature

High-throughput screening and classification of chemicals and their effects on neuronal gene expression using RASL-seq.

Project description:We previously used RNA-seq to identify chemicals whose effects on neuronal gene expression mimicked transcriptional signatures of autism, aging, and neurodegeneration. However, this approach was costly and time consuming, which limited our study to testing a single chemical concentration on mixed sex cortical neuron cultures. Here, we adapted a targeted transcriptomic method (RASL-seq, similar to TempO-seq) to interrogate changes in expression of a set of 56 signature genes in response to a library of 350 chemicals and chemical mixtures at four concentrations in male and female mouse neuronal cultures. This enabled us to replicate and expand our previous classifications, and show that transcriptional responses were largely equivalent between sexes. Overall, we found that RASL-seq can be used to accelerate the pace at which chemicals and mixtures that transcriptionally mimic autism and other neuropsychiatric diseases can be identified, and provides a cost-effective way to quantify gene expression with a panel of marker genes.

| S-EPMC6418307 | biostudies-literature

OmicsDI is part of the ELIXIR infrastructure

OmicsDI is an Elixir interoperability service. Learn more ›

Tweets

OmicsDI Databases

PRIDE
PeptideAtlas
MassIVE
JPOST Repository
Physiome Model Repository

EGA
EVA
ENA
LINCS
PAXDB
Cell Collective

MetaboLights
Metabolomics Workbench
MetabolomeExpress
GNPS
BioModels
FAIRDOMHub

ArrayExpress
dbGaP
ExpressionAtlas
GEO
NODE

Information

Databases
Help
API
Contact us
Code on GitHub
Terms of use
Submit Data