Browse
Submit Data
Databases
API
Help

Dataset Information

7 Views

0 Connections

0 Citations

0 Reanalyses

0 Downloads

Omics score: 0

A comprehensive re-analysis of the Golden Spike data: towards a benchmark for differential expression methods.

ABSTRACT:

Background

The Golden Spike data set has been used to validate a number of methods for summarizing Affymetrix data sets, sometimes with seemingly contradictory results. Much less use has been made of this data set to evaluate differential expression methods. It has been suggested that this data set should not be used for method comparison due to a number of inherent flaws.

Results

We have used this data set in a comparison of methods which is far more extensive than any previous study. We outline six stages in the analysis pipeline where decisions need to be made, and show how the results of these decisions can lead to the apparently contradictory results previously found. We also show that, while flawed, this data set is still a useful tool for method comparison, particularly for identifying combinations of summarization and differential expression methods that are unlikely to perform well on real data sets. We describe a new benchmark, AffyDEComp, that can be used for such a comparison.

Conclusion

We conclude with recommendations for preferred Affymetrix analysis tools, and for the development of future spike-in data sets.

SUBMITTER: Pearson RD

PROVIDER: S-EPMC2324099 | biostudies-literature |

REPOSITORIES: biostudies-literature

ACCESS DATA

Json Xml

Similar Datasets

Comprehensive evaluation of methods for differential expression analysis of metatranscriptomics data.

Project description:Understanding the function of the human microbiome is important but the development of statistical methods specifically for the microbial gene expression (i.e. metatranscriptomics) is in its infancy. Many currently employed differential expression analysis methods have been designed for different data types and have not been evaluated in metatranscriptomics settings. To address this gap, we undertook a comprehensive evaluation and benchmarking of 10 differential analysis methods for metatranscriptomics data. We used a combination of real and simulated data to evaluate performance (i.e. type I error, false discovery rate and sensitivity) of the following methods: log-normal (LN), logistic-beta (LB), MAST, DESeq2, metagenomeSeq, ANCOM-BC, LEfSe, ALDEx2, Kruskal-Wallis and two-part Kruskal-Wallis. The simulation was informed by supragingival biofilm microbiome data from 300 preschool-age children enrolled in a study of childhood dental disease (early childhood caries, ECC), whereas validations were sought in two additional datasets from the ECC study and an inflammatory bowel disease study. The LB test showed the highest sensitivity in both small and large samples and reasonably controlled type I error. Contrarily, MAST was hampered by inflated type I error. Upon application of the LN and LB tests in the ECC study, we found that genes C8PHV7 and C8PEV7, harbored by the lactate-producing Campylobacter gracilis, had the strongest association with childhood dental disease. This comprehensive model evaluation offers practical guidance for selection of appropriate methods for rigorous analyses of differential expression in metatranscriptomics. Selection of an optimal method increases the possibility of detecting true signals while minimizing the chance of claiming false ones.

| S-EPMC10516371 | biostudies-literature

Benchmarking RNA-seq differential expression analysis methods using spike-in and simulation data.

Project description:Benchmarking RNA-seq differential expression analysis methods using spike-in and simulated RNA-seq data has often yielded inconsistent results. The spike-in data, which were generated from the same bulk RNA sample, only represent technical variability, making the test results less reliable. We compared the performance of 12 differential expression analysis methods for RNA-seq data, including recent variants in widely used software packages, using both RNA spike-in and simulation data for negative binomial (NB) model. Performance of edgeR, DESeq2, and ROTS was particularly different between the two benchmark tests. Then, each method was tested under most extensive simulation conditions especially demonstrating the large impacts of proportion, dispersion, and balance of differentially expressed (DE) genes. DESeq2, a robust version of edgeR (edgeR.rb), voom with TMM normalization (voom.tmm) and sample weights (voom.sw) showed an overall good performance regardless of presence of outliers and proportion of DE genes. The performance of RNA-seq DE gene analysis methods substantially depended on the benchmark used. Based on the simulation results, suitable methods were suggested under various test conditions.

| S-EPMC7192453 | biostudies-literature

Comprehensive evaluation of differential gene expression analysis methods for RNA-seq data.

Project description:A large number of computational methods have been developed for analyzing differential gene expression in RNA-seq data. We describe a comprehensive evaluation of common methods using the SEQC benchmark dataset and ENCODE data. We consider a number of key features, including normalization, accuracy of differential expression detection and differential expression analysis when one condition has no detectable expression. We find significant differences among the methods, but note that array-based methods adapted to RNA-seq data perform comparably to methods designed for RNA-seq. Our results demonstrate that increasing the number of replicate samples significantly improves detection power over increased sequencing depth.

| S-EPMC4054597 | biostudies-literature

A comprehensive assessment of cell type-specific differential expression methods in bulk data.

Project description:Accounting for cell type compositions has been very successful at analyzing high-throughput data from heterogeneous tissues. Differential gene expression analysis at cell type level is becoming increasingly popular, yielding biomarker discovery in a finer granularity within a particular cell type. Although several computational methods have been developed to identify cell type-specific differentially expressed genes (csDEG) from RNA-seq data, a systematic evaluation is yet to be performed. Here, we thoroughly benchmark six recently published methods: CellDMC, CARseq, TOAST, LRCDE, CeDAR and TCA, together with two classical methods, csSAM and DESeq2, for a comprehensive comparison. We aim to systematically evaluate the performance of popular csDEG detection methods and provide guidance to researchers. In simulation studies, we benchmark available methods under various scenarios of baseline expression levels, sample sizes, cell type compositions, expression level alterations, technical noises and biological dispersions. Real data analyses of three large datasets on inflammatory bowel disease, lung cancer and autism provide evaluation in both the gene level and the pathway level. We find that csDEG calling is strongly affected by effect size, baseline expression level and cell type compositions. Results imply that csDEG discovery is a challenging task itself, with room to improvements on handling low signal-to-noise ratio and low expression genes.

| S-EPMC9851321 | biostudies-literature

A Benchmark for Data Imputation Methods.

Project description:With the increasing importance and complexity of data pipelines, data quality became one of the key challenges in modern software applications. The importance of data quality has been recognized beyond the field of data engineering and database management systems (DBMSs). Also, for machine learning (ML) applications, high data quality standards are crucial to ensure robust predictive performance and responsible usage of automated decision making. One of the most frequent data quality problems is missing values. Incomplete datasets can break data pipelines and can have a devastating impact on downstream ML applications when not detected. While statisticians and, more recently, ML researchers have introduced a variety of approaches to impute missing values, comprehensive benchmarks comparing classical and modern imputation approaches under fair and realistic conditions are underrepresented. Here, we aim to fill this gap. We conduct a comprehensive suite of experiments on a large number of datasets with heterogeneous data and realistic missingness conditions, comparing both novel deep learning approaches and classical ML imputation methods when either only test or train and test data are affected by missing data. Each imputation method is evaluated regarding the imputation quality and the impact imputation has on a downstream ML task. Our results provide valuable insights into the performance of a variety of imputation methods under realistic conditions. We hope that our results help researchers and engineers to guide their data preprocessing method selection for automated data quality improvement.

| S-EPMC8297389 | biostudies-literature

Comprehensive evaluation of differential gene expression analysis methods for RNA-seq data

2013-08-20 | GSE49712 | GEO

Comprehensive evaluation of differential gene expression analysis methods for RNA-seq data

Project description:A large number of computational methods have been recently developed for analyzing differential gene expression (DE) in RNA-seq data. We report on a comprehensive evaluation of the commonly used DE methods using the SEQC benchmark data set and data from ENCODE project. We evaluated a number of key features including: normalization, accuracy of DE detection and DE analysis when one condition has no detectable expression. We found significant differences among the methods. Furthermore, computational methods designed for DE detection from expression array data perform comparably to methods customized for RNA-seq. Most importantly, our results demonstrate that increasing the number of replicate samples significantly improves detection power over increased sequencing depth. The Sequencing Quality Control Consortium generated two datasets from two reference RNA samples in order to evaluate transcriptome profiling by next-generation sequencing technology. Each sample contains one of the reference RNA source and a set of synthetic RNAs from the External RNA Control Consortium (ERCC) at known concentrations. Group A contains 5 replicates of the Strategene Universal Human Reference RNA (UHRR), which is composed of total RNA from 10 human cell lines, with 2% by volume of ERCC mix 1. Group B includes 5 replicate samples of the Ambion Human Brain Reference RNA (HBRR) with 2% by volume of ERCC mix 2. The ERCC spike-in control is a mixture of 92 synthetic polyadenylated oligonucleotides of 250-2000 nucleotides long that are meant to resemble human transcripts.

2013-08-20 | E-GEOD-49712 | biostudies-arrayexpress

Comprehensive Evaluation of Differential Methylation Analysis Methods for Bisulfite Sequencing Data.

Project description:Background: With advances in next-generation sequencing technologies, the bisulfite conversion of genomic DNA followed by sequencing has become the predominant technique for quantifying genome-wide DNA methylation at single-base resolution. A large number of computational approaches are available in literature for identifying differentially methylated regions in bisulfite sequencing data, and more are being developed continuously. Results: Here, we focused on a comprehensive evaluation of commonly used differential methylation analysis methods and describe the potential strengths and limitations of each method. We found that there are large differences among methods, and no single method consistently ranked first in all benchmarking. Moreover, smoothing seemed not to improve the performance greatly, and a small number of replicates created more difficulties in the computational analysis of BS-seq data than low sequencing depth. Conclusions: Data analysis and interpretation should be performed with great care, especially when the number of replicates or sequencing depth is limited.

| S-EPMC8345583 | biostudies-literature

Towards a comprehensive evaluation of dimension reduction methods for transcriptomic data visualization.

Project description:Dimension reduction (DR) algorithms project data from high dimensions to lower dimensions to enable visualization of interesting high-dimensional structure. DR algorithms are widely used for analysis of single-cell transcriptomic data. Despite widespread use of DR algorithms such as t-SNE and UMAP, these algorithms have characteristics that lead to lack of trust: they do not preserve important aspects of high-dimensional structure and are sensitive to arbitrary user choices. Given the importance of gaining insights from DR, DR methods should be evaluated carefully before trusting their results. In this paper, we introduce and perform a systematic evaluation of popular DR methods, including t-SNE, art-SNE, UMAP, PaCMAP, TriMap and ForceAtlas2. Our evaluation considers five components: preservation of local structure, preservation of global structure, sensitivity to parameter choices, sensitivity to preprocessing choices, and computational efficiency. This evaluation can help us to choose DR tools that align with the scientific goals of the user.

| S-EPMC9296444 | biostudies-literature

A comprehensive comparison of differential accessibility analysis methods for ATAC-seq data

Project description:Background: ATAC-seq is widely used to measure the chromatin accessibility and identify the open chromatin regions (OCRs). OCRs usually indicate the active regulatory elements in the genome and are directly associated with gene regulatory networks. Identification of differential accessibility regions (DARs) between different biological conditions is critical to measure the differential activity of regulatory elements. Differential analysis of ATAC-seq shares many similarities to differential expression analysis of RNA-seq data. However, the distribution of ATAC-seq signal is different from RNA-seq data, and higher sensitivity is desired for DARs identification. Many different tools can be used to perform differential analysis of ATAC-seq data, but a comprehensive comparison and benchmarking of these methods is still missing. Methods: Here, we used simulated datasets to systematically measure the sensitivity and specificity of 6 different methods. We further discussed the statistical and signal density cutoff in the differential analysis of ATAC-seq by applying to real data. Batch-effect is very common in high-throughput sequencing experiments. Results: We illustrated that batch-effect correction can dramatically improve the sensitivity in differential analysis of ATAC-seq data. Finally, we developed an easily usable package, BeCorrect, to perform batch-effort correction for visualizing corrected ATAC-seq signals on a genome browser. Conclusions: It is important to use PCA to check the samples distribution, and the Remove Unwanted Variation strategy can be used to correct the data to improve the sensitivity when strong batch effects are found in the data. Finally, BeCorrect can be used to correct the batch-effect of ATAC-seq data signal based on DARs analysis, and generate a proper visualization on a genome browser.

2020-06-29 | GSE131144 | GEO

OmicsDI is part of the ELIXIR infrastructure

OmicsDI is an Elixir interoperability service. Learn more ›

Tweets

OmicsDI Databases

PRIDE
PeptideAtlas
MassIVE
JPOST Repository
Physiome Model Repository

EGA
EVA
ENA
LINCS
PAXDB
Cell Collective

MetaboLights
Metabolomics Workbench
MetabolomeExpress
GNPS
BioModels
FAIRDOMHub

ArrayExpress
dbGaP
ExpressionAtlas
GEO
NODE

Information

Databases
Help
API
Contact us
Code on GitHub
Terms of use
Submit Data