Dataset Information

Template proteogenomics: sequencing whole proteins using an imperfect database.

ABSTRACT: Database search algorithms are the primary workhorses for the identification of tandem mass spectra. However, these methods are limited to the identification of spectra for which peptides are present in the database, preventing the identification of peptides from mutated or alternatively spliced sequences. A variety of methods has been developed to search a spectrum against a sequence allowing for variations. Some tools determine the sequence of the homologous protein in the related species but do not report the peptide in the target organism. Other tools consider variations, including modifications and mutations, in reconstructing the target sequence. However, these tools will not work if the template (homologous peptide) is missing in the database, and they do not attempt to reconstruct the entire protein target sequence. De novo identification of peptide sequences is another possibility, because it does not require a protein database. However, the lack of database reduces the accuracy. We present a novel proteogenomic approach, GenoMS, that draws on the strengths of database and de novo peptide identification methods. Protein sequence templates (i.e. proteins or genomic sequences that are similar to the target protein) are identified using the database search tool InsPecT. The templates are then used to recruit, align, and de novo sequence regions of the target protein that have diverged from the database or are missing. We used GenoMS to reconstruct the full sequence of an antibody by using spectra acquired from multiple digests using different proteases. Antibodies are a prime example of proteins that confound standard database identification techniques. The mature antibody genes result from large-scale genome rearrangements with flexible fusion boundaries and somatic hypermutation. Using GenoMS we automatically reconstruct the complete sequences of two immunoglobulin chains with accuracy greater than 98% using a diverged protein database. Using the genome as the template, we achieve accuracy exceeding 97%.

SUBMITTER: Castellana NE

PROVIDER: S-EPMC2877985 | biostudies-literature | 2010 Jun

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Template proteogenomics: sequencing whole proteins using an imperfect database.

Castellana Natalie E NE Pham Victoria V Arnott David D Lill Jennie R JR Bafna Vineet V

Molecular & cellular proteomics : MCP 20100217 6

Database search algorithms are the primary workhorses for the identification of tandem mass spectra. However, these methods are limited to the identification of spectra for which peptides are present in the database, preventing the identification of peptides from mutated or alternatively spliced sequences. A variety of methods has been developed to search a spectrum against a sequence allowing for variations. Some tools determine the sequence of the homologous protein in the related species but ...[more]

PMID: 20164058

Similar Datasets

Project description:The three-dimensional structures of macromolecules and their complexes are mainly elucidated by X-ray protein crystallography. A major limitation of this method is access to high-quality crystals, which is necessary to ensure X-ray diffraction extends to sufficiently large scattering angles and hence yields information of sufficiently high resolution with which to solve the crystal structure. The observation that crystals with reduced unit-cell volumes and tighter macromolecular packing often produce higher-resolution Bragg peaks suggests that crystallographic resolution for some macromolecules may be limited not by their heterogeneity, but by a deviation of strict positional ordering of the crystalline lattice. Such displacements of molecules from the ideal lattice give rise to a continuous diffraction pattern that is equal to the incoherent sum of diffraction from rigid individual molecular complexes aligned along several discrete crystallographic orientations and that, consequently, contains more information than Bragg peaks alone. Although such continuous diffraction patterns have long been observed--and are of interest as a source of information about the dynamics of proteins--they have not been used for structure determination. Here we show for crystals of the integral membrane protein complex photosystem II that lattice disorder increases the information content and the resolution of the diffraction pattern well beyond the 4.5-ångström limit of measurable Bragg peaks, which allows us to phase the pattern directly. Using the molecular envelope conventionally determined at 4.5 ångströms as a constraint, we obtain a static image of the photosystem II dimer at a resolution of 3.5 ångströms. This result shows that continuous diffraction can be used to overcome what have long been supposed to be the resolution limits of macromolecular crystallography, using a method that exploits commonly encountered imperfect crystals and enables model-free phasing.

Dataset Information

Template proteogenomics: sequencing whole proteins using an imperfect database.

Publications

Template proteogenomics: sequencing whole proteins using an imperfect database.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets