Dataset Information

Benchmarking Peptide Spectral Library Search Dataset

ABSTRACT: Spectral library search (SLS) is a major approach for peptide identification from tandem mass spectrometry data, offering a complementary approach to conventional database search. Moreover, with the emergence of spectrum prediction models, proteomics database search is progressively becoming more like spectral library search of predicted peptide spectra. The performance of peptide identification algorithms thus frequently depends on how well the underlying Spectrum-Spectrum Matching (SSM) scoring functions distinguish true and false positive matches. However, detailed comparative studies evaluating the performance of SSM scoring functions remain limited by the absence of comprehensive benchmark datasets. We propose new methods to build benchmarks that assess the effectiveness and robustness of SSM scoring functions. The resulting benchmark dataset is composed of (i) a set of 476,063 precursors used to construct 8 query spectrum sets with different levels of noise added to "ideal" and real experimental spectra, and (ii) three spectral libraries with different spectra for the same 3,065,819 precursors: experimental spectra, annotated/de-noised spectra and predicted spectra. The benchmark set was then used to evaluate 9 common spectrum preprocessing scenarios, followed by the evaluation of 3 standard SSM scoring functions, Cosine, Projected-Cosine (commonly used for the analysis of chimeric/mixture spectra), and Jensen-Shannon divergence, and 2 additional scoring functions used in state-of-the-art SLS tools: SpectraST and EntropyScore. The results revealed that scoring spectrum-spectrum matches is still an important open problem, with the best recall for typical SLS searches still assessed to be poor at just ~70% at the typical 1% error rate. Overall, SpectraST performed best for spectra with little-to-no noise, but JS-divergence performed better in some cases as it was found to be most resistant to noise. Conversely, the performance of Cosine and Entropy score was found to be generally lower than previously reported, with Projected-Cosine performing especially poorly in most cases. However, the performance of the SSM scoring functions was also found to depend quite significantly on the minimum number of matching peaks required for each SSM, with benchmark results showing that the scoring functions' performance and relative ranking can be very significantly affected by how this important parameter is set. The resulting benchmark dataset can be used to test and support the development of SSM scoring functions and the proposed benchmark construction approach, providing a foundation that can be extended for additional types of spectrum-spectrum matching.

INSTRUMENT(S): Q Exactive

ORGANISM(S): Homo Sapiens (ncbitaxon:9606)

SUBMITTER: Nuno Bandeira