Project description:Interpreting the potentially vast number of hypotheses generated by a shotgun proteomics experiment requires a valid and accurate procedure for assigning statistical confidence estimates to identified tandem mass spectra. Despite the crucial role such procedures play in most high-throughput proteomics experiments, the scientific literature has not reached a consensus about the best confidence estimation methodology. In this work, we evaluate, using theoretical and empirical analysis, four previously proposed protocols for estimating the false discovery rate (FDR) associated with a set of identified tandem mass spectra: two variants of the target-decoy competition protocol (TDC) of Elias and Gygi and two variants of the separate target-decoy search protocol of Käll et al. Our analysis reveals significant biases in the two separate target-decoy search protocols. Moreover, the one TDC protocol that provides an unbiased FDR estimate among the target PSMs does so at the cost of forfeiting a random subset of high-scoring spectrum identifications. We therefore propose the mix-max procedure to provide unbiased, accurate FDR estimates in the presence of well-calibrated scores. The method avoids biases associated with the two separate target-decoy search protocols and also avoids the propensity for target-decoy competition to discard a random subset of high-scoring target identifications.
Project description:MotivationAccurate estimation of false discovery rate (FDR) of spectral identification is a central problem in mass spectrometry-based proteomics. Over the past two decades, target-decoy approaches (TDAs) and decoy-free approaches (DFAs) have been widely used to estimate FDR. TDAs use a database of decoy species to faithfully model score distributions of incorrect peptide-spectrum matches (PSMs). DFAs, on the other hand, fit two-component mixture models to learn the parameters of correct and incorrect PSM score distributions. While conceptually straightforward, both approaches lead to problems in practice, particularly in experiments that push instrumentation to the limit and generate low fragmentation-efficiency and low signal-to-noise-ratio spectra.ResultsWe introduce a new decoy-free framework for FDR estimation that generalizes present DFAs while exploiting more search data in a manner similar to TDAs. Our approach relies on multi-component mixtures, in which score distributions corresponding to the correct PSMs, best incorrect PSMs and second-best incorrect PSMs are modeled by the skew normal family. We derive EM algorithms to estimate parameters of these distributions from the scores of best and second-best PSMs associated with each experimental spectrum. We evaluate our models on multiple proteomics datasets and a HeLa cell digest case study consisting of more than a million spectra in total. We provide evidence of improved performance over existing DFAs and improved stability and speed over TDAs without any performance degradation. We propose that the new strategy has the potential to extend beyond peptide identification and reduce the need for TDA on all analytical platforms.Availabilityand implementationhttps://github.com/shawn-peng/FDR-estimation.Supplementary informationSupplementary data are available at Bioinformatics online.
Project description:False discovery rate (FDR) estimation is a cornerstone of proteomics that has recently been adapted to cross-linking/mass spectrometry. Here we demonstrate that heterobifunctional cross-linkers, while theoretically different from homobifunctional cross-linkers, need not be considered separately in practice. We develop and then evaluate the impact of applying a correct FDR formula for use of heterobifunctional cross-linkers and conclude that there are minimal practical advantages. Hence a single formula can be applied to data generated from the many different non-cleavable cross-linkers.
Project description:BackgroundOne of the most important steps in peptide identification is to estimate the false discovery rate (FDR). The most commonly used method for estimating FDR is the target-decoy search strategy (TDS). While this method is simple and effective, it is time/space-inefficient because it searches a database that is twice as large as the original protein database. This inefficiency problem becomes more evident as protein databases get bigger and bigger. We propose a target-small decoy search strategy and present a rigorous verification that it reduces the database size and search time while retaining the accuracy of target-decoy search strategy (TDS).ResultsWe show that peptide spectrum matches (PSMs) obtained at 1% FDR in TDS overlap ~ 99% with those in our method. (Considering that 1% FDR is used, 99% overlap means our method is very accurate.) Moreover, our method is more time/space-efficient than TDS. The search time of our method is reduced to only 1/4 of that of TDS when UniProt and its 1/8 decoy database are used.ConclusionsWe demonstrate that our method is almost as accurate as TDS and more time/space-efficient than TDS. Since the efficiency of our method is more evident as the database size increases, our method is expected to be useful for identifying peptides in proteogenomics databases constructed from inflated databases using genomic data.
Project description:BackgroundIn the context of genomic association studies, for which a large number of statistical tests are performed simultaneously, the local False Discovery Rate (lFDR), which quantifies the evidence of a specific gene association with a clinical or biological variable of interest, is a relevant criterion for taking into account the multiple testing problem. The lFDR not only allows an inference to be made for each gene through its specific value, but also an estimate of Benjamini-Hochberg's False Discovery Rate (FDR) for subsets of genes.ResultsIn the framework of estimating procedures without any distributional assumption under the alternative hypothesis, a new and efficient procedure for estimating the lFDR is described. The results of a simulation study indicated good performances for the proposed estimator in comparison to four published ones. The five different procedures were applied to real datasets.ConclusionA novel and efficient procedure for estimating lFDR was developed and evaluated.
Project description:We discuss the identification of genes that are associated with an outcome in RNA sequencing and other sequence-based comparative genomic experiments. RNA-sequencing data take the form of counts, so models based on the Gaussian distribution are unsuitable. Moreover, normalization is challenging because different sequencing experiments may generate quite different total numbers of reads. To overcome these difficulties, we use a log-linear model with a new approach to normalization. We derive a novel procedure to estimate the false discovery rate (FDR). Our method can be applied to data with quantitative, two-class, or multiple-class outcomes, and the computation is fast even for large data sets. We study the accuracy of our approaches for significance calculation and FDR estimation, and we demonstrate that our method has potential advantages over existing methods that are based on a Poisson or negative binomial model. In summary, this work provides a pipeline for the significance analysis of sequencing data.
Project description:We introduce a new decoy-free framework for false discovery rate (FDR) estimation that generalizes present decoy-free approaches (DFAs) while exploiting more search data in a manner similar to target-decoy approaches (TDAs).
Project description:Metabolite identification is a crucial step in mass spectrometry (MS)-based metabolomics. However, it is still challenging to assess the confidence of assigned metabolites. We report a novel method for estimating the false discovery rate (FDR) of metabolite assignment with a target-decoy strategy, in which the decoys are generated through violating the octet rule of chemistry by adding small odd numbers of hydrogen atoms. The target-decoy strategy was integrated into JUMPm, an automated metabolite identification pipeline for large-scale MS analysis and was also evaluated with two other metabolomics tools, mzMatch and MZmine 2. The reliability of FDR calculation was examined by false data sets, which were simulated by altering MS1 or MS2 spectra. Finally, we used the JUMPm pipeline coupled to the target-decoy strategy to process unlabeled and stable-isotope-labeled metabolomic data sets. The results demonstrate that the target-decoy strategy is a simple and effective method for evaluating the confidence of high-throughput metabolite identification.
Project description:We combine two important ideas in the analysis of large-scale genomics experiments (e.g. experiments that aim to identify genes that are differentially expressed between two conditions). The first is use of Empirical Bayes (EB) methods to handle the large number of potentially-sparse effects, and estimate false discovery rates and related quantities. The second is use of factor analysis methods to deal with sources of unwanted variation such as batch effects and unmeasured confounders. We describe a simple modular fitting procedure that combines key ideas from both these lines of research. This yields new, powerful EB methods for analyzing genomics experiments that account for both sparse effects and unwanted variation. In realistic simulations, these new methods provide significant gains in power and calibration over competing methods. In real data analysis, we find that different methods, while often conceptually similar, can vary widely in their assessments of statistical significance. This highlights the need for care in both choice of methods and interpretation of results.