Project description:It is well known, but frequently overlooked, that low- and high-throughput molecular data may contain batch effects, i.e., systematic technical variation. Confounding of experimental batches with the variable(s) of interest is especially concerning, as a batch effect may then be interpreted as a biologically significant finding. An integral step towards reducing false discovery in molecular data analysis includes inspection for batch effects and application of computational tools to reduce this signal if present. In a 30-sample pilot Illumina Infinium HumanMethylation450 (450k array) experiment, we identified two sources of batch effects: array row and chip. Here, we demonstrate two approaches taken to process the 450k data in which an R function, ComBat, was applied to adjust for this non-biological signal. In the “initial analysis”, the application of ComBat to an unbalanced study design resulted in 9,683 and 19,192 significant (FDR<0.05) DNA methylation differences, despite none present prior to correction. Suspicious of this dramatic change, a “revised processing” included changes to our analysis as well as a greater number of samples, and successfully reduced batch effects without introducing false signal. Our work supports conclusions made by an article previously published in this journal: though the ultimate antidote to batch effects is thoughtful study design, every DNA methylation microarray analysis should inspect, assess and, if necessary, adjust for batch effects. The analysis experience presented here can serve as a reminder to the broader community to establish research questions a priori, ensure that they match with study design and encourage communication between technicians and analysts.
Project description:It is well-known, but frequently overlooked, that low- and high-throughput molecular data may contain batch effects, i.e., systematic technical variation. Confounding of experimental batches with the variable(s) of interest is especially concerning, as a batch effect may then be interpreted as a biologically significant finding. An integral step toward reducing false discovery in molecular data analysis includes inspection for batch effects and accounting for this signal if present. In a 30-sample pilot Illumina Infinium HumanMethylation450 (450k array) experiment, we identified two sources of batch effects: row and chip. Here, we demonstrate two approaches taken to process the 450k data in which an R function, ComBat, was applied to adjust for the non-biological signal. In the "initial analysis," the application of ComBat to an unbalanced study design resulted in 9,612 and 19,214 significant (FDR < 0.05) DNA methylation differences, despite none present prior to correction. Suspicious of this dramatic change, a "revised processing" included changes to our analysis as well as a greater number of samples, and successfully reduced batch effects without introducing false signal. Our work supports conclusions made by an article previously published in this journal: though the ultimate antidote to batch effects is thoughtful study design, every DNA methylation microarray analysis should inspect, assess and, if necessary, account for batch effects. The analysis experience presented here can serve as a reminder to the broader community to establish research questions a priori, ensure that they match with study design and encourage communication between technicians and analysts.
Project description:Microarray is a powerful technique that has been used extensively for genome-wide gene expression analysis. Several different microarray technologies are available, but lack of standardization makes it challenging to compare and integrate data from different platforms. Furthermore, batch related biases within datasets are common, but are often not tackled prior to the data analysis, potentially affecting the end results. In the current study, a set of 234 breast cancer samples were analyzed on two different microarray platforms. The aim was to compare and evaluate the reproducibility and accuracy of gene expression measurements obtained from our in-house 29K array platform with data from Agilent SurePrint G3 microarray platform. The 29K dataset contained known batch-effects associated with the fabrication procedure. We here demonstrate how the use of ComBat batch adjustments method can unmask true biological signals by successfully overcoming systematic technical variations caused by differences between fabrication batches and microarray platforms. Paired correlation analysis revealed a high level of consistency between data obtained from the 29K gene expression platform and Agilent SurePrint G3 platform, which could be further improved by ComBat batch adjustment. Particularly high-variance genes were found to be highly reproducibly expressed across platforms. Furthermore, high concordance rates were observed both for prediction of estrogen receptor status and intrinsic molecular breast cancer subtype classification, two clinical important parameters. In conclusion, the current study emphasizes the importance of utilizing proper batch adjustment methods to reduce systematically technical bias when comparing and integrating data from different fabrication batches and microarray platforms.
Project description:The expression microarray is a frequently used approach to study gene expression on a genome-wide scale. However, the data produced by the thousands of microarray studies published annually are confounded by "batch effects," the systematic error introduced when samples are processed in multiple batches. Although batch effects can be reduced by careful experimental design, they cannot be eliminated unless the whole study is done in a single batch. A number of programs are now available to adjust microarray data for batch effects prior to analysis. We systematically evaluated six of these programs using multiple measures of precision, accuracy and overall performance. ComBat, an Empirical Bayes method, outperformed the other five programs by most metrics. We also showed that it is essential to standardize expression data at the probe level when testing for correlation of expression profiles, due to a sizeable probe effect in microarray data that can inflate the correlation among replicates and unrelated samples.
Project description:A key challenge in single cell RNA-sequencing (scRNA-seq) data analysis are dataset- and batch-specific differences that can obscure the biological signal of interest. While there are various tools and methods to perform data integration and correct for batch effects, their performance can vary between datasets and according to the nature of the bias. Therefore, it is important to understand how batch effects manifest in order to adjust for them in a reliable way. Here, we systematically explore batch effects in scRNA-seq data from a variety of datasets according to magnitude, cell type specificity and complexity. We developed a cell-specific mixing score (\texttt{cms}) that quantifies how well cells from multiple batches are mixed. By considering distance distributions (in a lower dimensional space), the score is able to detect local batch bias and differentiate between unbalanced batches (i.e., when one cell type is more abundant in a batch) and systematic differences between cells of the same cell type. We implemented the \texttt{cms}, as well as related metrics to detect batch effects or measure structure preservation, in the CellMixS R/Bioconductor package. We systematically compare different metrics that have been proposed to quantify batch effects or bias in scRNA-seq data using real datasets with known batch effects and synthetic data that mimic various real data scenarios. While these metrics target the same question and are used interchangeably, we find differences in inter- and intra-dataset scalability, sensitivity and in a metric's ability to handle batch effects with differentially abundant cell types. We find that cell-specific metrics outperform cell type-specific and global metrics and recommend them for both method benchmarks and batch exploration.
Project description:We generated two comprehensive large-scale proteomics datasets with deliberate batch effects using the latest parallel accumulation-serial fragmentation in both Data-Dependent and Data-Indepentdent Acquisition modes. This dataset contain a balanced two-class design (cell lines: A549 vs K562), allowing for investigating mixed effects from class, batch and acquisition method. Investigators can also compare and integrate DDA and DIA platforms, delve into the various patterns and mechanisms of missing values, benchmark batch effects correction algorithms and assess confounding between different technical issues.
Project description:We generated two comprehensive large-scale proteomics datasets with deliberate batch effects using the latest parallel accumulation-serial fragmentation in both Data-Dependent and Data-Indepentdent Acquisition modes. This dataset contain a balanced two-class design (cell lines: HCC1806 vs HS578T), allowing for investigating mixed effects from class, batch and acquisition method. Investigators can also compare and integrate DDA and DIA platforms, delve into the various patterns and mechanisms of missing values, benchmark batch effects correction algorithms and assess confounding between different technical issues.
Project description:The great utility of microarrays for genome-scale expression analysis is challenged by the widespread presence of batch effects, which bias expression measurements in particular within large data sets. These unwanted technical artifacts can obscure biological variation and thus significantly reduce the reliability of the analysis results. It is largely unknown which are the predominant technical sources leading to batch effects. We here quantitatively assess the prevalence and impact of several known technical effects on microarray expression results. Particularly, we focus on important factors such as RNA degradation, RNA quantity, and sequence biases including multiple guanine effects. We find that the common variation of RNA quality and RNA quantity can not only yield low-quality expression results, but that both factors also correlate with batch effects and biological characteristics of the samples.
Project description:The goal of the study was to identify the hippocampal microenvironment asscoaited with stress, in particular stress-induced learned helplessness. By comparing chronic and acute induction of learned helplessness, we identified cell populations and genes associated with resilience.