Dataset Information

Accurate reconstruction of bacterial pan- and core genomes with PEPPAN.

ABSTRACT: Bacterial genomes can contain traces of a complex evolutionary history, including extensive homologous recombination, gene loss, gene duplications, and horizontal gene transfer. To reconstruct the phylogenetic and population history of a set of multiple bacteria, it is necessary to examine their pangenome, the composite of all the genes in the set. Here we introduce PEPPAN, a novel pipeline that can reliably construct pangenomes from thousands of genetically diverse bacterial genomes that represent the diversity of an entire genus. PEPPAN outperforms existing pangenome methods by providing consistent gene and pseudogene annotations extended by similarity-based gene predictions, and identifying and excluding paralogs by combining tree- and synteny-based approaches. The PEPPAN package additionally includes PEPPAN_parser, which implements additional downstream analyses, including the calculation of trees based on accessory gene content or allelic differences between core genes. To test the accuracy of PEPPAN, we implemented SimPan, a novel pipeline for simulating the evolution of bacterial pangenomes. We compared the accuracy and speed of PEPPAN with four state-of-the-art pangenome pipelines using both empirical and simulated data sets. PEPPAN was more accurate and more specific than any of the other pipelines and was almost as fast as any of them. As a case study, we used PEPPAN to construct a pangenome of approximately 40,000 genes from 3052 representative genomes spanning at least 80 species of Streptococcus The resulting gene and allelic trees provide an unprecedented overview of the genomic diversity of the entire Streptococcus genus.

SUBMITTER: Zhou Z

PROVIDER: S-EPMC7605250 | biostudies-literature | 2020 Nov

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Accurate reconstruction of bacterial pan- and core genomes with PEPPAN.

Zhou Zhemin Z Charlesworth Jane J Achtman Mark M

Genome research 20201014 11

Bacterial genomes can contain traces of a complex evolutionary history, including extensive homologous recombination, gene loss, gene duplications, and horizontal gene transfer. To reconstruct the phylogenetic and population history of a set of multiple bacteria, it is necessary to examine their pangenome, the composite of all the genes in the set. Here we introduce PEPPAN, a novel pipeline that can reliably construct pangenomes from thousands of genetically diverse bacterial genomes that repres ...[more]

PMID: 33055096

Similar Datasets

Project description:Acquisition of ecologically relevant genes is common among ocean bacteria, but whether it has a major impact on genome evolution in marine environments remains unknown. Here, we analyzed the core genomes of 16 phylogenetically diverse and ecologically relevant bacterioplankton lineages, each consisting of up to five genomes varying at the strain level. Statistical approaches identified from each lineage up to ∼50 loci showing anomalously high divergence at synonymous sites, which is best explained by recombination with distantly related organisms. The enriched gene categories in these outlier loci match well with the characteristics previously identified as the key phenotypes of these lineages. Examples are antibiotic synthesis and detoxification in Phaeobacter inhibens, exopolysaccharide production in Alteromonas macleodii, hydrocarbon degradation in Marinobacter hydrocarbonoclasticus, and cold adaptation in Pseudoalteromonas haloplanktis Intriguingly, the outlier loci feature polysaccharide catabolism in Cellulophaga baltica but not in Cellulophaga lytica, consistent with their primary habitat preferences in macroalgae and beach sands, respectively. Likewise, analysis of Prochlorococcus showed that photosynthesis-related genes listed in the outlier loci are found only in the high-light-adapted ecotype and not in the low-light adapted ecotype. These observations strongly suggest that recombination with distant relatives is a key mechanism driving the ecological diversification among marine bacterial lineages.IMPORTANCE Acquisition of new metabolic genes has been known as an important mechanism driving bacterial evolution and adaptation in the ocean, but acquisition of novel alleles of existing genes and its potential ecological role have not been examined. Guided by population genetic theories, our genomic analysis showed that divergent allele acquisition is prevalent in phylogenetically diverse marine bacterial lineages and that the affected loci often encode metabolic functions that underlie the known ecological roles of the lineages under study.

Project description:Studying the evolution of viruses and their molecular epidemiology relies on accurate viral sequence data, so that small differences between similar viruses can be meaningfully interpreted. Despite its higher throughput and more detailed minority variant data, next-generation sequencing has yet to be widely adopted for HIV. The difficulty of accurately reconstructing the consensus sequence of a quasispecies from reads (short fragments of DNA) in the presence of large between- and within-host diversity, including frequent indels, may have presented a barrier. In particular, mapping (aligning) reads to a reference sequence leads to biased loss of information; this bias can distort epidemiological and evolutionary conclusions. De novo assembly avoids this bias by aligning the reads to themselves, producing a set of sequences called contigs. However contigs provide only a partial summary of the reads, misassembly may result in their having an incorrect structure, and no information is available at parts of the genome where contigs could not be assembled. To address these problems we developed the tool shiver to pre-process reads for quality and contamination, then map them to a reference tailored to the sample using corrected contigs supplemented with the user's choice of existing reference sequences. Run with two commands per sample, it can easily be used for large heterogeneous data sets. We used shiver to reconstruct the consensus sequence and minority variant information from paired-end short-read whole-genome data produced with the Illumina platform, for sixty-five existing publicly available samples and fifty new samples. We show the systematic superiority of mapping to shiver's constructed reference compared with mapping the same reads to the closest of 3,249 real references: median values of 13 bases called differently and more accurately, 0 bases called differently and less accurately, and 205 bases of missing sequence recovered. We also successfully applied shiver to whole-genome samples of Hepatitis C Virus and Respiratory Syncytial Virus. shiver is publicly available from https://github.com/ChrisHIV/shiver.

Dataset Information

Accurate reconstruction of bacterial pan- and core genomes with PEPPAN.

Publications

Accurate reconstruction of bacterial pan- and core genomes with PEPPAN.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets