Project description:Proteogenomics approaches often struggle with the distinction between right and false peptide-to-spectrum matches as the database size enlarges. However, features extracted from tandem mass spectrometry intensity predictors can enhance the peptide identification rate and can provide extra confidence for spectral matching in a proteogenomic context. To that end, features from the spectral intensity pattern predictors MS2PIP and Prosit were combined with the canonical scores from MaxQuant in the Percolator post-processing tool for protein databases constructed from RNA-seq and ribosome profiling analyses. The presented results provide evidence that this approach enhances the peptide identification power in a proteogenomic setting and in the meantime they lead to the validation of new proteoforms with elevated stringency. In this online repository, we submitted the conventional proteomic search results with MaxQuant against the custom nanopore RNA-seq-based search space. All other results can be found in the supplemental materials of the manuscript, in SRA (sequencing data) or under ProteomeXChange Project PXD011353 (as this is original data from a previuos paper).
Project description:Predictors built from gene expression data accurately predict ER, PR, and HER2 status, and divide tumor grade into high-grade and low-grade clusters; intermediate-grade tumors are not a unique group. In contrast, gene expression data cannot be used to predict tumor size or lymphatic-vascular invasion. Experiment Overall Design: Microarray data from the tumors of 129 patients were analyzed for the ability to predict biomarkers (ER, PR, HER2), histologic features (grade and lymphatic-vascular invasion), and stage-related information (tumor size and lymph node metastasis). Multiple statistical predictors were used and the prediction accuracy determined by error rates of prediction and by dimensional scaling and visualization of the states under study. Models to predict lymph node metastasis were built by combinations of molecular, histologic and anatomic features.
Project description:Proteogenomics methods have identified many non-annotated protein-coding genes in the human genome. Many of the newly discovered protein-coding genes encode peptides and small proteins, referred to collectively as microproteins. Microproteins are produced through ribosome translation of small open reading frames (smORFs). The discovery of many smORFs reveals a blind spot in traditional gene-finding algorithms for these genes. Biological studies have found roles for microproteins in cell biology and physiology, and the potential that there exists additional bioactive microproteins drives the interest in detection and discovery of these molecules. A key step in any proteogenomics workflow is the assembly of RNA-Seq data into likely mRNA transcrips that are then used to create a searchable protein databases. Here we demonstrate that specific features of the assembled transcriptome impact microprotein detection by shotgun proteomics. By tailoring transcript assembly for downstream mass spectrometry searching, we show that we can detect more than double the number of high-quality microprotein candidates and introduce a novel open-source mRNA assembler for proteogenomics (MAPS) that incorporates all of these features. By integrating our specialized assembler, MAPS, and a popular generalized assembler into our proteogenomics pipeline, we detect 45 novel human microproteins from a high quality proteogenomics dataset of a human cell line. We then characterize the features of the novel microproteins, identifying two classes of microproteins. Our work highlights the importance of specialized transcriptome assembly upstream of proteomics validation when searching for short and potentially rare and poorly conserved proteins.
Project description:Proteogenomics is an emerging research field yet lacking a standard method of analysis. In this article, we demonstrate the strength of proteogenomic analysis specific for N-terminal data that aims at the discovery of novel translational start sites. In summary, unidentified spectra were matched to a specific N-terminal peptide library encompassing all theoretical protein N-termini encoded in the genome. Gene prediction suggested 81 protein-coding models, of which several alternative proteoforms with unannotated protein starts. Next to the proteomic data, complementary ribosome footprinting data was generated from Arabidopsis thaliana cell cultures. Translation initiation site mapping by the ribosome footprinting data provided orthogonal evidence for 14 novel peptides identified by our proteogenomics pipeline.
Project description:Here, we present OryzaPG-DB, a rice proteome database based on shotgun proteogenomics, which incorporates the genomic features of experimental shotgun proteomics data. This version of the database was created from the results of 27 nanoLC-MS/MS runs on a hybrid ion trap-orbitrap mass spectrometer, which offers high accuracy for analyzing tryptic digests from undifferentiated cultured rice cells. Peptides were identified by searching the product ion spectra against the protein, cDNA, transcript and genome databases from Michigan State University, and were mapped to the rice genome. Approximately 3200 genes were covered by these peptides and 40 of them contained novel genomic features. Users can search, download or navigate the database per chromosome, gene, protein, cDNA or transcript and download the updated annotations in standard GFF3 format, with visualization in PNG format. In addition, the database scheme of OryzaPG was designed to be generic and can be reused to host similar proteogenomic information for other species. OryzaPG is the first proteogenomics-based database of the rice proteome, providing peptide-based expression profiles, together with the corresponding genomic origin, including the annotation of novelty for each peptide.
Project description:We describe the use Galaxy framework to facilitate complete proteogenomic analysis for a representative salivary dataset. We demonstrate how Galaxy’s many features make it a unique and ideal solution for proteogenomic analysis. We highlight Galaxy’s flexibility by creating a modular workflow incorporating both established and customized software and processing steps that improve depth and quality of proteogenomic results. We demonstrate Galaxy’s accessibility, via the easy sharing of complete, and even complex (approximately 140 steps), proteogenomic workflows, which can be used and customized by others via a public instance of the framework (usegalaxyp.org). Our results provide a blueprint for the establishment of the Galaxy framework as an ideal solution for the emerging field of proteogenomics.
Project description:Detection of species-specific proteotypic peptides for accurate and easy characterization of infectious non-tuberculous mycobacteria such as Mycobacterium kansasii is essential. Therefore, we carried out an in-depth global proteomic experiment using M. kansasii ATCC 12478 strain followed by proteome database search and spectral library generation. The lysate was subjected to in-solution proteomic sample preparation and fractionated using an offline C18 StageTip. Each fraction was acquired in technical triplicates using a 180 min data-dependent acquisition (DDA) method in Orbitrap Fusion Tribrid (Thermo Scientific) mass spectrometer. The resulting raw DDA data were searched against the M. kansasii proteome database using Proteome Discoverer and FragPipe. The resulting peptide spectrum matches were converted into a spectral library using BiblioSpec.
Project description:We analyzed the gene expression profile during the first 90 minutes of embryonic wound healing and compared it with embryos with a chronic problem with the production of nitric oxide (NO) (embryos originating from oocytes that were injected with morpholinos against nos1 and nos3) and embryos with an acute problem with the production of NO (embryos incubated in a solution containing TRIM).
Project description:Detection of species-specific proteotypic peptides for accurate and easy characterization of infectious non-tuberculous mycobacteria such as Mycobacterium intracellulare is essential. Therefore, we carried out an in-depth global proteomic experiment using M. intracellulare ATCC 13950 strain followed by proteome database search and spectral library generation. The lysate was subjected to in-solution proteomic sample preparation and fractionated using an off-line C18 StageTip. Each fraction was acquired in technical triplicates using a 180 min data-dependent acquisition (DDA) method in Orbitrap Fusion Tribrid (Thermo Scientific) mass spectrometer. The resulting raw DDA data were searched against the M. intracellulare proteome database using Proteome Discoverer and FragPipe. The resulting peptide spectrum matches were converted into a spectral library using BiblioSpec.
Project description:Detection of species-specific proteotypic peptides for accurate and easy characterization of infectious non-tuberculous mycobacteria such as Mycobacterium fortuitum is essential. Therefore, we carried out an in-depth global proteomic experiment using M. fortuitum ATCC 6841 strain followed by a proteome database search and spectral library generation. The lysate was subjected to in-solution proteomic sample preparation and fractionated using an offline C18 StageTip. Each fraction was acquired in technical triplicates using a 180 min data-dependent acquisition (DDA) method in Orbitrap Fusion Tribrid (Thermo Scientific) mass spectrometer. The resulting raw DDA data were searched against the M. fortuitum proteome database using Proteome Discoverer and FragPipe. The resulting peptide spectrum matches were converted into a spectral library using BiblioSpec.