Dataset Information

Exploring bacterial diversity via a curated and searchable snapshot of archived DNA sequences

ABSTRACT: The open sharing of genomic data provides an incredibly rich resource for the study of bacterial evolution and function and even anthropogenic activities such as the widespread use of antimicrobials. However, these data consist of genomes assembled with different tools and levels of quality checking, and of large volumes of completely unprocessed raw sequence data. In both cases, considerable computational effort is required before biological questions can be addressed. Here, we assembled and characterised 661,405 bacterial genomes retrieved from the European Nucleotide Archive (ENA) in November of 2018 using a uniform standardised approach. Of these, 311,006 did not previously have an assembly. We produced a searchable COmpact Bit-sliced Signature (COBS) index, facilitating the easy interrogation of the entire dataset for a specific sequence (e.g., gene, mutation, or plasmid). Additional MinHash and pp-sketch indices support genome-wide comparisons and estimations of genomic distance. Combined, this resource will allow data to be easily subset and searched, phylogenetic relationships between genomes to be quickly elucidated, and hypotheses rapidly generated and tested. We believe that this combination of uniform processing and variety of search/filter functionalities will make this a resource of very wide utility. In terms of diversity within the data, a breakdown of the 639,981 high-quality genomes emphasised the uneven species composition of the ENA/public databases, with just 20 of the total 2,336 species making up 90% of the genomes. The overrepresented species tend to be acute/common human pathogens, aligning with research priorities at different levels from individual interests to funding bodies and national and global public health agencies. This study presents the first uniformly assembled, comprehensively described and searchable dataset of 661,405 bacterial genomes; this resource will empower more scientists to harness the multitude of data in public sequencing archives, but also reveals the biased composition of these archives, with 90% of the data originating from just 20 species.

SUBMITTER: Blackwell G

PROVIDER: S-EPMC8577725 | biostudies-literature |

REPOSITORIES: biostudies-literature

ACCESS DATA

Similar Datasets

Project description:A cornerstone of bacterial molecular biology is the ability to genetically manipulate the microbe under study. Many bacteria are difficult to manipulate genetically, a phenotype due in part to robust removal of newly acquired DNA, for example, by restriction-modification (R-M) systems. Here, we report approaches that dramatically improve bacterial transformation efficiency, piloted using a microbe that is challenging to transform due to expression of many R-M systems, Helicobacter pylori. Initially, we identified conditions that dampened expression of several R-M systems and concomitantly enhanced transformation efficiency. We then identified an approach that would broadly protect newly acquired DNA. We computationally predicted under-represented short DNA sequences in the H. pylori genome, with the idea that these sequences reflect targets of sequence-based surveillance such as R-M systems. We then used this information to modify and eliminate such sites in antibiotic resistance cassettes, creating a "stealth" version. Modifying antibiotic resistance cassettes in this way resulted in significantly higher transformation efficiency compared to non-modified cassettes, a response that was genomic loci independent. Our results suggest that avoiding R-M systems, via modification of under-represented DNA sequences or transformation conditions, is a powerful method to enhance DNA transformation. Our approach to identify under-represented sequences is applicable to any microbe with a sequenced genome.IMPORTANCEManipulating the genomes of bacteria is critical to many fields. Such manipulations are made by genetic engineering, which often requires new pieces of DNA to be added to the genome. Bacteria have robust systems for identifying and degrading new DNA, some of which rely on restriction enzymes. These enzymes cut DNA at specific sequences. We identified a set of DNA sequences that are missing normally from a bacterium's genome, more than would be expected by chance. Eliminating these sequences from a new piece of DNA allowed it to be incorporated into the bacterial genome at a higher frequency than new DNA containing the sequences. Removing such sequences appears to allow the new DNA to fly under the bacterial radar in "stealth" mode. This transformation improvement approach is straightforward to apply and likely broadly applicable.

Dataset Information

Exploring bacterial diversity via a curated and searchable snapshot of archived DNA sequences

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets