Dataset Information

An integrative strategy to identify the entire protein coding potential of prokaryotic genomes by proteogenomics.

ABSTRACT: Accurate annotation of all protein-coding sequences (CDSs) is an essential prerequisite to fully exploit the rapidly growing repertoire of completely sequenced prokaryotic genomes. However, large discrepancies among the number of CDSs annotated by different resources, missed functional short open reading frames (sORFs), and overprediction of spurious ORFs represent serious limitations. Our strategy toward accurate and complete genome annotation consolidates CDSs from multiple reference annotation resources, ab initio gene prediction algorithms and in silico ORFs (a modified six-frame translation considering alternative start codons) in an integrated proteogenomics database (iPtgxDB) that covers the entire protein-coding potential of a prokaryotic genome. By extending the PeptideClassifier concept of unambiguous peptides for prokaryotes, close to 95% of the identifiable peptides imply one distinct protein, largely simplifying downstream analysis. Searching a comprehensive Bartonella henselae proteomics data set against such an iPtgxDB allowed us to unambiguously identify novel ORFs uniquely predicted by each resource, including lipoproteins, differentially expressed and membrane-localized proteins, novel start sites and wrongly annotated pseudogenes. Most novelties were confirmed by targeted, parallel reaction monitoring mass spectrometry, including unique ORFs and single amino acid variations (SAAVs) identified in a re-sequenced laboratory strain that are not present in its reference genome. We demonstrate the general applicability of our strategy for genomes with varying GC content and distinct taxonomic origin. We release iPtgxDBs for B. henselae, Bradyrhizobium diazoefficiens and Escherichia coli and the software to generate both proteogenomics search databases and integrated annotation files that can be viewed in a genome browser for any prokaryote.

SUBMITTER: Omasits U

PROVIDER: S-EPMC5741054 | biostudies-literature | 2017 Dec

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

An integrative strategy to identify the entire protein coding potential of prokaryotic genomes by proteogenomics.

Omasits Ulrich U Varadarajan Adithi R AR Schmid Michael M Goetze Sandra S Melidis Damianos D Bourqui Marc M Nikolayeva Olga O Québatte Maxime M Patrignani Andrea A Dehio Christoph C Frey Juerg E JE Robinson Mark D MD Wollscheid Bernd B Ahrens Christian H CH

Genome research 20171115 12

Accurate annotation of all protein-coding sequences (CDSs) is an essential prerequisite to fully exploit the rapidly growing repertoire of completely sequenced prokaryotic genomes. However, large discrepancies among the number of CDSs annotated by different resources, missed functional short open reading frames (sORFs), and overprediction of spurious ORFs represent serious limitations. Our strategy toward accurate and complete genome annotation consolidates CDSs from multiple reference annotatio ...[more]

PMID: 29141959

Similar Datasets

Project description:Fusobacterium spp. are Gram-negative, anaerobic, opportunistic pathogens involved in multiple diseases, including a link between the oral pathogen Fusobacterium nucleatum and the progression and severity of colorectal cancer. The identification and characterization of virulence factors in the genus Fusobacterium has been greatly hindered by a lack of properly assembled and annotated genomes. Using newly completed genomes from nine strains and seven species of Fusobacterium, we report the identification and corrected annotation of verified and potential virulence factors from the type 5 secreted autotransporter, FadA, and MORN2 protein families, with a focus on the genetically tractable strain F. nucleatum subsp. nucleatum ATCC 23726 and type strain F. nucleatum subsp. nucleatum ATCC 25586. Within the autotransporters, we used sequence similarity networks to identify protein subsets and show a clear differentiation between the prediction of outer membrane adhesins, serine proteases, and proteins with unknown function. These data have identified unique subsets of type 5a autotransporters, which are key proteins associated with virulence in F. nucleatum However, we coupled our bioinformatic data with bacterial binding assays to show that a predicted weakly invasive strain of F. necrophorum that lacks a Fap2 autotransporter adhesin strongly binds human colonocytes. These analyses confirm a gap in our understanding of how autotransporters, MORN2 domain proteins, and FadA adhesins contribute to host interactions and invasion. In summary, we identify candidate virulence genes in Fusobacterium, and caution that experimental validation of host-microbe interactions should complement bioinformatic predictions to increase our understanding of virulence protein contributions in Fusobacterium infections and disease.IMPORTANCE Fusobacterium spp. are emerging pathogens that contribute to mammalian and human diseases, including colorectal cancer. Despite a validated connection with disease, few proteins have been characterized that define a direct molecular mechanism for Fusobacterium pathogenesis. We report a comprehensive examination of virulence-associated protein families in multiple Fusobacterium species and show that complete genomes facilitate the correction and identification of multiple, large type 5a secreted autotransporter genes in previously misannotated or fragmented genomes. In addition, we use protein sequence similarity networks and human cell interaction experiments to show that previously predicted noninvasive strains can indeed bind to and potentially invade human cells and that this could be due to the expansion of specific virulence proteins that drive Fusobacterium infections and disease.

Dataset Information

An integrative strategy to identify the entire protein coding potential of prokaryotic genomes by proteogenomics.

Publications

An integrative strategy to identify the entire protein coding potential of prokaryotic genomes by proteogenomics.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets