Unknown

Dataset Information

0

Proteogenomic database construction driven from large scale RNA-seq data.


ABSTRACT: The advent of inexpensive RNA-seq technologies and other deep sequencing technologies for RNA has the promise to radically improve genomic annotation, providing information on transcribed regions and splicing events in a variety of cellular conditions. Using MS-based proteogenomics, many of these events can be confirmed directly at the protein level. However, the integration of large amounts of redundant RNA-seq data and mass spectrometry data poses a challenging problem. Our paper addresses this by construction of a compact database that contains all useful information expressed in RNA-seq reads. Applying our method to cumulative C. elegans data reduced 496.2 GB of aligned RNA-seq SAM files to 410 MB of splice graph database written in FASTA format. This corresponds to 1000× compression of data size, without loss of sensitivity. We performed a proteogenomics study using the custom data set, using a completely automated pipeline, and identified a total of 4044 novel events, including 215 novel genes, 808 novel exons, 12 alternative splicings, 618 gene-boundary corrections, 245 exon-boundary changes, 938 frame shifts, 1166 reverse strands, and 42 translated UTRs. Our results highlight the usefulness of transcript + proteomic integration for improved genome annotations.

SUBMITTER: Woo S 

PROVIDER: S-EPMC4034692 | biostudies-literature | 2014 Jan

REPOSITORIES: biostudies-literature

altmetric image

Publications

Proteogenomic database construction driven from large scale RNA-seq data.

Woo Sunghee S   Cha Seong Won SW   Merrihew Gennifer G   He Yupeng Y   Castellana Natalie N   Guest Clark C   MacCoss Michael M   Bafna Vineet V  

Journal of proteome research 20130717 1


The advent of inexpensive RNA-seq technologies and other deep sequencing technologies for RNA has the promise to radically improve genomic annotation, providing information on transcribed regions and splicing events in a variety of cellular conditions. Using MS-based proteogenomics, many of these events can be confirmed directly at the protein level. However, the integration of large amounts of redundant RNA-seq data and mass spectrometry data poses a challenging problem. Our paper addresses thi  ...[more]

Similar Datasets

| S-EPMC5161273 | biostudies-literature
| S-EPMC6048772 | biostudies-literature
| S-EPMC8015854 | biostudies-literature
| S-EPMC6902034 | biostudies-literature
| S-EPMC3931556 | biostudies-literature
| S-EPMC7437817 | biostudies-literature
| S-EPMC7671411 | biostudies-literature
| S-EPMC4256132 | biostudies-literature
| S-EPMC4706714 | biostudies-literature
| S-EPMC6317475 | biostudies-literature