Browse
Submit Data
Databases
API
Help

Dataset Information

0 Views

0 Connections

0 Citations

0 Reanalyses

0 Downloads

Omics score: 0

A transformer model for de novo sequencing of data-independent acquisition mass spectrometry data

ABSTRACT: A core computational challenge in the analysis of mass spectrometry data is the de novo sequencing problem, in which the generating amino acid sequence is inferred directly from an observed fragmentation spectrum without the use of a sequence database. Recently, deep learning models have made significant advances in de novo sequencing by learning from massive datasets of high confidence labeled mass spectra. However, these methods are primarily designed for data-dependent acquisition (DDA) experiments. Over the past decade, the field of mass spectrometry has been moving toward using data-independent acquisition (DIA) protocols for the analysis of complex proteomic samples due to their superior specificity and reproducibility. Hence, we present a new de novo sequencing model called Cascadia, which uses a transformer architecture to handle the more complex data generated by DIA protocols. In comparisons with existing approaches for de novo sequencing of DIA data, Cascadia achieves improved performance across a range of instruments and experimental protocols. Additionally, we demonstrate Cascadia’s ability to accurately discover de novo coding variants and peptides from the variable region of antibodies.

ORGANISM(S): Homo Sapiens Mus Musculus

SUBMITTER: Michael MacCoss

PROVIDER: PXD053291 | panorama | Fri Jun 21 00:00:00 BST 2024

REPOSITORIES: PanoramaPublic

ACCESS DATA

Json Xml

Similar Datasets

De novo sequencing of DIA data

Project description:Testing datasets and pre-trained model for DeepNovo, a deep learning-based tool for de novo sequencing of DIA data.

2018-05-16 | MSV000082368 | MassIVE

Multienzyme deep learning models improve peptide de novo sequencing by mass spectrometry proteomics

Project description:Generating and analyzing overlapping peptides through multienzymatic digestion is an efficient procedure for de novo protein using from bottom-up mass spectrometry (MS). Despite improved instrumentation and software, de novo MS data analysis remains challenging. In recent years, deep learning models have represented a performance breakthrough. Incorporating that technology into de novo protein sequencing workflows require machine-learning models capable of handling highly diverse MS data. In this study, we analyzed the requirements for assembling such generalizable deep learning models by systematically varying the composition and size of the training set. We assessed the generated models' performances using two test sets composed of peptides originating from the multienzyme digestion of samples from various species. The peptide recall values on the test sets showed that the deep learning models generated from a collection of highly N- and C-termini diverse peptides generalized 76% more over the termini-restricted ones. Moreover, expanding the training set's size by adding peptides from the multienzymatic digestion with five proteases of several species samples led to a 2-3 fold generalizability gain. Furthermore, we tested the applicability of these multienzyme deep learning (MEM) models by fully de novo sequencing the heavy and light monomeric chains of five commercial antibodies (mAbs). MEMs extracted over 10000 matching and overlapped peptides across six different proteases mAb samples, achieving a 100% sequence coverage for 8 of the ten polypeptide chains. We foretell that the MEMs' proven improvements to de novo analysis will positively impact several applications, such as analyzing samples of high complexity, unknown nature, or the peptidomics field.

2023-01-16 | PXD037803 | Pride

Evaluating sequence alignment for metaproteomics applications

Project description:This dataset was utilized to assess the performance of a novel de novo metaproteomics pipeline, which performs sequence alignment of de novo sequences from complete metaproteomics experiments. Traditionally, metaproteomics data annotation relies on database searching that requires sample-specific databases derived from whole metagenome sequencing experiments. Creating these databases, however, is a complex, time-consuming, and error prone process, which can introduce biases affecting the outcomes and conclusions, highlighting the need for alternative methods. The evaluated approach offers rapid and orthogonal insights into metaproteomics data.

2024-10-10 | PXD050548 | Pride

Soil metaproteomics data analysis based on de novo sequencing with the deep learning-based Kaiko model

Project description:We generated a protein database directly from soil metaproteomic data by identifying the microbial composition using the Kaiko model's de novo sequencing methods. We first analyzed the mass spectra de novo (without a database), identifying species from the observed peptides. We next gathered full proteomic databases for the identified species and searched the mass spec data using MS-GF+ and this custom-assembled protein sequence database.

2020-10-20 | MSV000086336 | MassIVE

Application of de novo sequencing to large-scale complex proteomics datasets

Project description:Dependent on concise, pre-defined protein sequence databases, traditional search algorithms perform poorly when analyzing mass spectra derived from wholly uncharacterized protein products. Conversely, de novo peptide sequencing algorithms can interpret mass spectra without relying on reference databases. However, such algorithms have been difficult to apply to complex protein mixtures, in part due to a lack of methods for automatically validating de novo sequencing results. Here, we present novel metrics for benchmarking de novo sequencing algorithm performance on large scale proteomics datasets, and present a method for accurately calibrating false discovery rates on de novo results. We also present a novel algorithm (LADS) which leverages experimentally disambiguated fragmentation spectra to boost sequencing accuracy and sensitivity. LADS improves sequencing accuracy on longer peptides relative to other algorithms and improves discriminability of correct and incorrect sequences. Using these advancements, we demonstrate accurate de novo identification of peptide sequences not identifiable using database search-based approaches.

2016-01-12 | PXD003317 | Pride

Prochlorococcus marinus MED4 shotgun proteome dataset from “Protein cycling in the eastern tropical North Pacific oxygen deficient zone: a de novo-discovery peptidomic approach”

Project description:For this manuscript, the Prochlorococcus MED4 strain shotgun proteome dataset was used for benchmarking a de novo-directed sequencing approach. De novo peptide sequencing, where the sequence of amino acids is determined directly from mass spectra rather than by comparison (or peptide spectrum matching) to a selected database. We perform a benchmarking experiment using Prochlorococcus culture data, demonstrating de novo peptides are sufficiently accurate and taxonomically specific to be useful in environmental studies. The MED4 dataset herein represents the output from peptide spectrum matching using COMET within the transproteomic pipeline (TPP). Additional MED4 data outside this manuscript are included for both trypsin and Glu-C protease digestions as well as TPP output for post-translational modification searches. De novo output data derived from Peaks Studio can be found by referencing the manuscript publication.

2022-01-06 | PXD027589 | Pride

De novo nine-species benchmark peptide identifications for Casanovo and other methods

Project description:Predicted peptides for the 9-species de novo sequencing benchmark MSV000090982 as described in Yilmaz et al. [Yilmaz2023]. FTP directory contains outputs of 5 de novo peptide sequencing methods on the 9-species benchmark: Casanovo, Casanovo_bm (benchmark), PointNovo, DeepNovo and Novor. [Yilmaz2023] M. Yilmaz*, W. Fondrie*, W. Bittremieux*, R. Nelson, V. Ananth, S. Oh, and W. Noble,"Sequence-to-sequence translation from mass spectra to peptides with a transformer model", bioRxiv, 2023

2024-02-01 | MSV000093979 | MassIVE

Upload for 2012 MCP manuscript - "Shotgun protein sequencing with meta-contig assembly."

Project description:Shotgun protein sequencing with meta-contig assembly. Full-length de novo sequencing from tandem mass (MS/MS) spectra of unknown proteins such as antibodies or proteins from organisms with unsequenced genomes remains a challenging open problem. Conventional algorithms designed to individually sequence each MS/MS spectrum are limited by incomplete peptide fragmentation or low signal to noise ratios and tend to result in short de novo sequences at low sequencing accuracy. Our shotgun protein sequencing (SPS) approach was developed to ameliorate these limitations by first finding groups of unidentified spectra from the same peptides (contigs) and then deriving a consensus de novo sequence for each assembled set of spectra (contig sequences). But whereas SPS enables much more accurate reconstruction of de novo sequences longer than can be recovered from individual MS/MS spectra, it still requires error-tolerant matching to homologous proteins to group smaller contig sequences into full-length protein sequences, thus limiting its effectiveness on sequences from poorly annotated proteins. Using low and high resolution CID and high resolution HCD MS/MS spectra, we address this limitation with a Meta-SPS algorithm designed to overlap and further assemble SPS contigs into Meta-SPS de novo contig sequences extending as long as 100 amino acids at over 97% accuracy without requiring any knowledge of homologous protein sequences. We demonstrate Meta-SPS using distinct MS/MS data sets obtained with separate enzymatic digestions and discuss how the remaining de novo sequencing limitations relate to MS/MS acquisition settings.

2014-01-16 | MSV000078530 | MassIVE

SRA of transcriptome of Hevea brasiliensis

Project description:We first report the use of next-generation massively parallel sequencing technologies and de novo transcriptome assembly to gain insight into the wide range of transcriptome of Hevea brasiliensis. The output of sequenced data showed that more than 12 million sequence reads with average length of 90nt were generated. Totally 48,768 unigenes (mean size = 488 bp) were assembled through transcriptome de novo assembly, which represent more than 3-fold of all the sequences of Hevea brasiliensis deposited in the GenBank. Assembled sequences were annotated with gene descriptions, gene ontology and clusters of orthologous group terms. Total 37,373 unigenes were successfully annotated and more than 10% of unigenes were aligned to known proteins of Euphorbiaceae. The unigenes contain nearly complete collection of known rubber-synthesis-related genes. Our data provides the most comprehensive sequence resource available for study rubber tree and demonstrates the availability of Illumina sequencing and de novo transcriptome assembly in a species lacking genome information. The transcriptome of latex and leaf in Hevea brasiliensis

2011-09-01 | E-GEOD-26514 | biostudies-arrayexpress

RNA-seq of 36 individuals with autism spectrum disorder

Project description:To assess the clinical impact of splice-altering noncoding mutations in autism spectrum disorder (ASD), we used a deep learning framework (SpliceAI) to predict the splice-altering potential of de novo mutations in 3,953 individuals with ASD from the Simons Simplex Collection. To validate these predictions, we selected 36 individuals that harbored predicted de-novo cryptic splice mutations; each individual represented the only case of autism within their immediate family. We obtained peripheral blood-derived lymphoblastoid cell lines (LCLs) and performed high-depth mRNA sequencing (approximately 350 million 150 bp single-end reads per sample). We used OLego to align the reads against a reference created from hg19 by substituting de novo variants of each individual with the corresponding alternate allele.

2018-11-25 | E-MTAB-7351 | biostudies-arrayexpress

OmicsDI is part of the ELIXIR infrastructure

OmicsDI is an Elixir interoperability service. Learn more ›

Tweets

OmicsDI Databases

PRIDE
PeptideAtlas
MassIVE
JPOST Repository
Physiome Model Repository

EGA
EVA
ENA
LINCS
PAXDB
Cell Collective

MetaboLights
Metabolomics Workbench
MetabolomeExpress
GNPS
BioModels
FAIRDOMHub

ArrayExpress
dbGaP
ExpressionAtlas
GEO
NODE

Information

Databases
Help
API
Contact us
Code on GitHub
Terms of use
Submit Data