Dataset Information

MzDB: a file format using multiple indexing strategies for the efficient analysis of large LC-MS/MS and SWATH-MS data sets.

ABSTRACT: The analysis and management of MS data, especially those generated by data independent MS acquisition, exemplified by SWATH-MS, pose significant challenges for proteomics bioinformatics. The large size and vast amount of information inherent to these data sets need to be properly structured to enable an efficient and straightforward extraction of the signals used to identify specific target peptides. Standard XML based formats are not well suited to large MS data files, for example, those generated by SWATH-MS, and compromise high-throughput data processing and storing. We developed mzDB, an efficient file format for large MS data sets. It relies on the SQLite software library and consists of a standardized and portable server-less single-file database. An optimized 3D indexing approach is adopted, where the LC-MS coordinates (retention time and m/z), along with the precursor m/z for SWATH-MS data, are used to query the database for data extraction. In comparison with XML formats, mzDB saves ∼25% of storage space and improves access times by a factor of twofold up to even 2000-fold, depending on the particular data access. Similarly, mzDB shows also slightly to significantly lower access times in comparison with other formats like mz5. Both C++ and Java implementations, converting raw or XML formats to mzDB and providing access methods, will be released under permissive license. mzDB can be easily accessed by the SQLite C library and its drivers for all major languages, and browsed with existing dedicated GUIs. The mzDB described here can boost existing mass spectrometry data analysis pipelines, offering unprecedented performance in terms of efficiency, portability, compactness, and flexibility.

SUBMITTER: Bouyssie D

PROVIDER: S-EPMC4349994 | biostudies-literature |

REPOSITORIES: biostudies-literature

ACCESS DATA

Similar Datasets

Project description:Despite immense interest in the proteome as a source of biomarkers in cancer, mass spectrometry has yet to yield a clinically useful protein biomarker for tumor classification. To explore the potential of a particular class of mass spectrometry-based quantitation approaches, label-free alignment of liquid chromatography coupled to tandem mass spectrometry (LC-MS/MS) data sets, for the identification of biomarkers for acute leukemias, we asked whether a label-free alignment algorithm could distinguish known classes of leukemias on the basis of their proteomes. This approach to quantitation involves (1) computational alignment of MS1 peptide peaks across large numbers of samples; (2) measurement of the relative abundance of peptides across samples by integrating the area under the curve of the MS1 peaks; and (3) assignment of peptide IDs to those quantified peptide peaks on the basis of the corresponding MS2 spectra. We extracted proteins from blasts derived from four patients with acute myeloid leukemia (AML, acute leukemia of myeloid lineage) and five patients with acute lymphoid leukemia (ALL, acute leukemia of lymphoid lineage). Mobilized CD34+ cells purified from peripheral blood of six healthy donors and mononuclear cells (MNC) from the peripheral blood of two healthy donors were used as healthy controls. Proteins were analyzed by LC-MS/MS and quantified with a label-free alignment-based algorithm developed in our laboratory. Unsupervised hierarchical clustering of blinded samples separated the samples according to their known biological characteristics, with each sample group forming a discrete cluster. The four proteins best able to distinguish CD34+, AML, and ALL were all either known biomarkers or proteins whose biological functions are consistent with their ability to distinguish these classes. We conclude that alignment-based label-free quantitation of LC-MS/MS data sets can, at least in some cases, robustly distinguish known classes of leukemias, thus opening the possibility that large scale studies using such algorithms can lead to the identification of clinically useful biomarkers.

Dataset Information

MzDB: a file format using multiple indexing strategies for the efficient analysis of large LC-MS/MS and SWATH-MS data sets.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets