Dataset Information

Whole Proteome Clustering of 2,307 Proteobacterial Genomes Reveals Conserved Proteins and Significant Annotation Issues.

ABSTRACT: We clustered 8.76 M protein sequences deduced from 2,307 completely sequenced Proteobacterial genomes resulting in 707,311 clusters of one or more sequences of which 224,442 ranged in size from 2 to 2,894 sequences. To our knowledge this is the first study of this scale. We were surprised to find that no single cluster contained a representative sequence from all the organisms in the study. Given the minimal genome concept, we expected to find a shared set of proteins. To determine why the clusters did not have universal representation we chose four essential proteins, the chaperonin GroEL, DNA dependent RNA polymerase subunits beta and beta' (RpoB/RpoB'), and DNA polymerase I (PolA), representing fundamental cellular functions, and examined their cluster distribution. We found these proteins to be remarkably conserved with certain caveats. Although the groEL gene was universally conserved in all the organisms in the study, the protein was not represented in all the deduced proteomes. The genes for RpoB and RpoB' were missing from two genomes and merged in 88, and the sequences were sufficiently divergent that they formed separate clusters for 18 RpoB proteins (seven clusters) and 14 RpoB' proteins (three clusters). For PolA, 52 organisms lacked an identifiable sequence, and seven sequences were sufficiently divergent that they formed five separate clusters. Interestingly, organisms lacking an identifiable PolA and those with divergent RpoB/RpoB' were predominantly endosymbionts. Furthermore, we present a range of examples of annotation issues that caused the deduced proteins to be incorrectly represented in the proteome. These annotation issues made our task of determining protein conservation more difficult than expected and also represent a significant obstacle for high-throughput analyses.

SUBMITTER: Lockwood S

PROVIDER: S-EPMC6403173 | biostudies-literature | 2019

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Whole Proteome Clustering of 2,307 Proteobacterial Genomes Reveals Conserved Proteins and Significant Annotation Issues.

Lockwood Svetlana S Brayton Kelly A KA Daily Jeff A JA Broschat Shira L SL

Frontiers in microbiology 20190228

We clustered 8.76 M protein sequences deduced from 2,307 completely sequenced Proteobacterial genomes resulting in 707,311 clusters of one or more sequences of which 224,442 ranged in size from 2 to 2,894 sequences. To our knowledge this is the first study of this scale. We were surprised to find that no single cluster contained a representative sequence from all the organisms in the study. Given the minimal genome concept, we expected to find a shared set of proteins. To determine why the clust ...[more]

PMID: 30873148

Similar Datasets

Project description:We analyzed several features of five currently available delta-proteobacterial genomes, including two aerobic bacteria exhibiting predatory behavior and three anaerobic sulfate-reducing bacteria. The delta genomes are distinguished from other bacteria by several properties: (i) The delta genomes contain two "giant" S1 ribosomal protein genes in contrast to all other bacterial types, which encode a single or no S1; (ii) in most delta-proteobacterial genomes the major ribosomal protein (RP) gene cluster is near the replication terminus whereas most bacterial genomes place the major RP cluster near the origin of replication; (iii) the delta genomes possess the rare combination of discriminating asparaginyl and glutaminyl tRNA synthetase (AARS) together with the amido-transferase complex (Gat CAB) genes that modify Asp-tRNA(Asn) into Asn-tRNA(Asn) and Glu-tRNA(Gln) into Gln-tRNA(Gln); (iv) the TonB receptors and ferric siderophore receptors that facilitate uptake and removal of complex metals are common among delta genomes; (v) the anaerobic delta genomes encode multiple copies of the anaerobic detoxification protein rubrerythrin that can neutralize hydrogen peroxide; and (vi) sigma(54) activators play a more important role in the delta genomes than in other bacteria. delta genomes have a plethora of enhancer binding proteins that respond to environmental and intracellular cues, often as part of two-component systems; (vii) delta genomes encode multiple copies of metallo-beta-lactamase enzymes; (viii) a host of secretion proteins emphasizing SecA, SecB, and SecY may be especially useful in the predatory activities of Myxococcus xanthus; (ix) delta proteobacteria drive many multiprotein machines in their periplasms and outer membrane, including chaperone-feeding machines, jets for slime secretion, and type IV pili. Bdellovibrio replicates in the periplasm of prey cells. The sulfate-reducing delta proteobacteria metabolize hydrogen and generate a proton gradient by electron transport. The predicted highly expressed genes from delta genomes reflect their different ecologies, metabolic strategies, and adaptations.

Dataset Information

Whole Proteome Clustering of 2,307 Proteobacterial Genomes Reveals Conserved Proteins and Significant Annotation Issues.

Publications

Whole Proteome Clustering of 2,307 Proteobacterial Genomes Reveals Conserved Proteins and Significant Annotation Issues.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets