Project description:Protein reference databases are a critical part of producing efficient proteomic analyses. However, the method for constructing clean, efficient, and comprehensive protein reference databases is lacking. Existing methods either do not have contamination control procedures, or these methods rely on a three-frame and/or six-frame translation that sharply increases the search space and harms MS results. Herein we propose a framework for constructing a customized comprehensive proteomic reference database (CCPRD) from draft genomes and deep sequencing transcriptomes. Its effectiveness is demonstrated by incorporating the proteomes of nematocysts from endoparasitic cnidarian: myxozoans. By applying customized contamination removal procedures, contaminations in omic data were successfully identified and removed. This is an effective method that does not result in over-decontamination. This can be shown by comparing the CCPRD MS results with an artificially-contaminated database and another database with removed contaminations in genomes and transcriptomes added back. CCPRD outperformed traditional frame-based methods by identifying 35.2%-50.7% more peptides and 35.8%-43.8% more proteins, with a maximum 84.6% in size reduction. A BUSCO analysis showed that the CCPRD maintained a relatively high level of completeness compared to traditional methods. These results confirm the superiority of the CCPRD over existing methods in peptide and protein identification numbers, database size, and completeness. By providing a general framework for generating the reference database, the CCPRD, which does not need a high-quality genome, can potentially be applied to any organisms and significantly contribute to proteomic research.
Project description:<p>Traveler's diarrhea (TD) is caused by enterotoxigenic Escherichia coli (ETEC), other pathogenic gram-negative pathogens, norovirus and some parasites. Nevertheless, standard diagnostic methods fail to identify pathogens in more than 30% of TD patients, so it is predicted that new pathogens or groups of pathogens may be causative agents of disease. A comprehensive metagenomic study of the fecal microbiomes from 23 TD patients and seven healthy travelers was performed, all of which tested negative for the known etiologic agents of TD in standard tests. Metagenomic reads were assembled and the resulting contigs were subjected to semi-manual binning to assemble independent genomes from metagenomic pools. Taxonomic and functional annotations were conducted to assist identification of putative pathogens. We extracted 560 draft genomes, 320 of which were complete enough to be enough characterized as cellular genomes and 160 of which were bacteriophage genomes. We made predictions of the etiology of disease in individual subjects based on the properties and features of the recovered cellular genomes. Three subtypes of samples were observed. First were four patients with low diversity metagenomes that were predominated by one or more pathogenic E. coli strains. Annotation allowed prediction of pathogenic type in most cases. Second, five patients were co-infected with E. coli and other members of the Enterobacteriaceae, including antibiotic resistant Enterobacter, Klebsiella, and Citrobacter. Finally, several samples contained genomes that represented dark matter. In one of these samples we identified a TM7 genome that phylogenetically clustered with a strain isolated from wastewater and carries genes encoding potential virulence factors. We also observed a very high proportion of bacteriophage reads in some samples. The relative abundance of phage was significantly higher in healthy travelers when compared to TD patients. Our results highlight that assembly-based analysis revealed that diarrhea is often polymicrobial and includes members of the Enterobacteriaceae not normally associated with TD and have implicated a new member of the TM7 phylum as a potential player in diarrheal disease. </p>