Dataset Information

Protein identification using customized protein sequence databases derived from RNA-Seq data.

ABSTRACT: The standard shotgun proteomics data analysis strategy relies on searching MS/MS spectra against a context-independent protein sequence database derived from the complete genome sequence of an organism. Because transcriptome sequence analysis (RNA-Seq) promises an unbiased and comprehensive picture of the transcriptome, we reason that a sample-specific protein database derived from RNA-Seq data can better approximate the real protein pool in the sample and thus improve protein identification. In this study, we have developed a two-step strategy for building sample-specific protein databases from RNA-Seq data. First, the database size is reduced by eliminating unexpressed or lowly expressed genes according to transcript quantification. Second, high-quality nonsynonymous coding single nucleotide variations (SNVs) are identified based on RNA-Seq data, and corresponding protein variants are added to the database. Using RNA-Seq and shotgun proteomics data from two colorectal cancer cell lines SW480 and RKO, we demonstrated that customized protein sequence databases could significantly increase the sensitivity of peptide identification, reduce ambiguity in protein assembly, and enable the detection of known and novel peptide variants. Thus, sample-specific databases from RNA-Seq data can enable more sensitive and comprehensive protein discovery in shotgun proteomics studies.

SUBMITTER: Wang X

PROVIDER: S-EPMC3727138 | biostudies-literature | 2012 Feb

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Protein identification using customized protein sequence databases derived from RNA-Seq data.

Wang Xiaojing X Slebos Robbert J C RJ Wang Dong D Halvey Patrick J PJ Tabb David L DL Liebler Daniel C DC Zhang Bing B

Journal of proteome research 20111214 2

The standard shotgun proteomics data analysis strategy relies on searching MS/MS spectra against a context-independent protein sequence database derived from the complete genome sequence of an organism. Because transcriptome sequence analysis (RNA-Seq) promises an unbiased and comprehensive picture of the transcriptome, we reason that a sample-specific protein database derived from RNA-Seq data can better approximate the real protein pool in the sample and thus improve protein identification. In ...[more]

PMID: 22103967

Similar Datasets

Project description:BackgroundCell lines form the cornerstone of cell-based experimentation studies into understanding the underlying mechanisms of normal and disease biology including cancer. However, it is commonly acknowledged that contamination of cell lines is a prevalent problem affecting biomedical science and available methods for cell line authentication suffer from limited access as well as being too daunting and time-consuming for many researchers. Therefore, a new and cost effective approach for authentication and quality control of cell lines is needed.ResultsWe have developed a new RNA-seq based approach named CeL-ID for cell line authentication. CeL-ID uses RNA-seq data to identify variants and compare with variant profiles of other cell lines. RNA-seq data for 934 CCLE cell lines downloaded from NCI GDC were used to generate cell line specific variant profiles and pair-wise correlations were calculated using frequencies and depth of coverage values of all the variants. Comparative analysis of variant profiles revealed that variant profiles differ significantly from cell line to cell line whereas identical, synonymous and derivative cell lines share high variant identity and are highly correlated (ρ > 0.9). Our benchmarking studies revealed that CeL-ID method can identify a cell line with high accuracy and can be a valuable tool of cell line authentication in biomedical science. Finally, CeL-ID estimates the possible cross contamination using linear mixture model if no perfect match was detected.ConclusionsIn this study, we show the utility of an RNA-seq based approach for cell line authentication. Our comparative analysis of variant profiles derived from RNA-seq data revealed that variant profiles of each cell line are distinct and overall share low variant identity with other cell lines whereas identical or synonymous cell lines show significantly high variant identity and hence variant profiles can be used as a discriminatory/identifying feature in cell authentication model.

Dataset Information

Protein identification using customized protein sequence databases derived from RNA-Seq data.

Publications

Protein identification using customized protein sequence databases derived from RNA-Seq data.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets