Dataset Information

Proteogenomic strategies for identification of aberrant cancer peptides using large-scale next-generation sequencing data.

ABSTRACT: Cancer is driven by the acquisition of somatic DNA lesions. Distinguishing the early driver mutations from subsequent passenger mutations is key to molecular subtyping of cancers, understanding cancer progression, and the discovery of novel biomarkers. The advances of genomics technologies (whole-genome exome, and transcript sequencing, collectively referred to as NGS (next-generation sequencing)) have fueled recent studies on somatic mutation discovery. However, the vision is challenged by the complexity, redundancy, and errors in genomic data, and the difficulty of investigating the proteome translated portion of aberrant genes using only genomic approaches. Combination of proteomic and genomic technologies are increasingly being employed. Various strategies have been employed to allow the usage of large-scale NGS data for conventional MS/MS searches. This paper provides a discussion of applying different strategies relating to large database search, and FDR (false discovery rate) -based error control, and their implication to cancer proteogenomics. Moreover, it extends and develops the idea of a unified genomic variant database that can be searched by any MS sample. A total of 879 BAM files downloaded from TCGA repository were used to create a 4.34 GB unified FASTA database that contained 2787062 novel splice junctions, 38,464 deletions, 1,105 insertions, and 182,302 substitutions. Proteomic data from a single ovarian carcinoma sample (439,858 spectra) was searched against the database. By applying the most conservative FDR measure, we have identified 524 novel peptides and 65,578 known peptides at 1% FDR threshold. The novel peptides include interesting examples of doubly mutated peptides, frame-shifts, and nonsample-recruited mutations, which emphasize the strength of our approach.

SUBMITTER: Woo S

PROVIDER: S-EPMC4256132 | biostudies-literature | 2014 Dec

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Proteogenomic strategies for identification of aberrant cancer peptides using large-scale next-generation sequencing data.

Woo Sunghee S Cha Seong Won SW Na Seungjin S Guest Clark C Liu Tao T Smith Richard D RD Rodland Karin D KD Payne Samuel S Bafna Vineet V

Proteomics 20141117 23-24

Cancer is driven by the acquisition of somatic DNA lesions. Distinguishing the early driver mutations from subsequent passenger mutations is key to molecular subtyping of cancers, understanding cancer progression, and the discovery of novel biomarkers. The advances of genomics technologies (whole-genome exome, and transcript sequencing, collectively referred to as NGS (next-generation sequencing)) have fueled recent studies on somatic mutation discovery. However, the vision is challenged by the ...[more]

PMID: 25263569

Dataset Information

Proteogenomic strategies for identification of aberrant cancer peptides using large-scale next-generation sequencing data.

Publications

Proteogenomic strategies for identification of aberrant cancer peptides using large-scale next-generation sequencing data.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

Identification of pharmacogenetic variants from large scale next generation sequencing data in the Saudi population.
| S-EPMC8797234 | biostudies-literature

Challenges in the Setup of Large-scale Next-Generation Sequencing Analysis Workflows.
| S-EPMC5683667 | biostudies-literature

In vivo mRNA display enables large-scale proteomics by next generation sequencing.
| S-EPMC7604504 | biostudies-literature

GenoPheno: cataloging large-scale phenotypic and next-generation sequencing data within human datasets.
| S-EPMC7820848 | biostudies-literature

Enabling large-scale next-generation sequence assembly with Blacklight.
| S-EPMC4185199 | biostudies-other

Next Generation Protein Sequencing
2017-04-03 | PXD003804 | Pride

Next-generation sequencing and recombinant expression characterized aberrant splicing mechanisms and provided correction strategies in factor VII deficiency.
| S-EPMC7049351 | biostudies-literature

PAPNC, a novel method to calculate nucleotide diversity from large scale next generation sequencing data.
| S-EPMC4104926 | biostudies-literature

Group-based variant calling leveraging next-generation supercomputing for large-scale whole-genome sequencing studies.
| S-EPMC4580299 | biostudies-literature

Large-scale MHC class II genotyping of a wild lemur population by next generation sequencing.
| S-EPMC3496554 | biostudies-literature