An accessible, efficient and global approach for the large-scale sequencing of bacterial genomes.
Ontology highlight
ABSTRACT: We have developed an efficient and inexpensive pipeline for streamlining large-scale collection and genome sequencing of bacterial isolates. Evaluation of this method involved a worldwide research collaboration focused on the model organism Salmonella enterica, the 10KSG consortium. Following the optimization of a logistics pipeline that involved shipping isolates as thermolysates in ambient conditions, the project assembled a diverse collection of 10,419 isolates from low- and middle-income countries. The genomes were sequenced using the LITE pipeline for library construction, with a total reagent cost of less than USD$10 per genome. Our method can be applied to other large bacterial collections to underpin global collaborations.
Project description:Programmable CRISPR-Cas systems have augmented our ability to produce precise genome manipulations. Here we demonstrate and characterize the ability of CRISPR-Cas derived nickases to direct targeted recombination of both small and large genomic regions flanked by repetitive elements in Escherichia coli. While CRISPR directed double-stranded DNA breaks are highly lethal in many bacteria, we show that CRISPR-guided nickase systems can be programmed to make precise, nonlethal, single-stranded incisions in targeted genomic regions. This induces recombination events and leads to targeted deletion. We demonstrate that dual-targeted nicking enables deletion of 36 and 97 Kb of the genome. Furthermore, multiplex targeting enables deletion of 133 Kb, accounting for approximately 3% of the entire E. coli genome. This technology provides a framework for methods to manipulate bacterial genomes using CRISPR-nickase systems. We envision this system working synergistically with preexisting bacterial genome engineering methods.
Project description:Complex human diseases are affected by genetic and environmental risk factors and their interactions. Gene-environment interaction (GEI) tests for aggregate genetic variant sets have been developed in recent years. However, existing statistical methods become rate limiting for large biobank-scale sequencing studies with correlated samples. We propose efficient Mixed-model Association tests for GEne-Environment interactions (MAGEE), for testing GEI between an aggregate variant set and environmental exposures on quantitative and binary traits in large-scale sequencing studies with related individuals. Joint tests for the aggregate genetic main effects and GEI effects are also developed. A null generalized linear mixed model adjusting for covariates but without any genetic effects is fit only once in a whole genome GEI analysis, thereby vastly reducing the overall computational burden. Score tests for variant sets are performed as a combination of genetic burden and variance component tests by accounting for the genetic main effects using matrix projections. The computational complexity is dramatically reduced in a whole genome GEI analysis, which makes MAGEE scalable to hundreds of thousands of individuals. We applied MAGEE to the exome sequencing data of 41,144 related individuals from the UK Biobank, and the analysis of 18,970 protein coding genes finished within 10.4 CPU hours.
Project description:BackgroundInconsistencies are often observed in the genome annotations of bacterial strains. Moreover, these inconsistencies are often not reflected by sequence discrepancies, but are caused by wrongly annotated gene starts as well as mis-identified gene presence. Thus, tools are needed for improving annotation consistency and accuracy among sets of bacterial strain genomes.ResultsWe have developed eCAMBer, a tool for efficiently supporting comparative analysis of multiple bacterial strains within the same species. eCAMBer is a highly optimized revision of our earlier tool, CAMBer, scaling it up for significantly larger datasets comprising hundreds of bacterial strains. eCAMBer works in two phases. First, it transfers gene annotations among all considered bacterial strains. In this phase, it also identifies homologous gene families and annotation inconsistencies. Second, eCAMBer, tries to improve the quality of annotations by resolving the gene start inconsistencies and filtering out gene families arising from annotation errors propagated in the previous phase.Conclusions[corrected] eCAMBer efficiently identifies and resolves annotation inconsistencies among closely related bacterial genomes. It outperforms other competing tools both in terms of running time and accuracy of produced annotations. Software, user manual, and case study results are available at the project website: http://bioputer.mimuw.edu.pl/ecamber.
Project description:Bacteria possess an amazing capacity to synthesize a diverse range of structurally complex, bioactive natural products known as specialized (or secondary) metabolites. Many of these specialized metabolites are used as clinical therapeutics, while others have important ecological roles in microbial communities. The biosynthetic gene clusters (BGCs) that generate these metabolites can be identified in bacterial genome sequences using their highly conserved genetic features. We analyzed an unprecedented 1,566 bacterial genomes from Bacillus species and identified nearly 20,000 BGCs. By comparing these BGCs to one another as well as a curated set of known specialized metabolite BGCs, we discovered that the majority of Bacillus natural products are comprised of a small set of highly conserved, well-distributed, known natural product compounds. Most of these metabolites have important roles influencing the physiology and development of Bacillus species. We identified, in addition to these characterized compounds, many unique, weakly conserved BGCs scattered across the genus that are predicted to encode unknown natural products. Many of these "singleton" BGCs appear to have been acquired via horizontal gene transfer. Based on this large-scale characterization of metabolite production in the Bacilli, we go on to connect the alkylpyrones, natural products that are highly conserved but previously biologically uncharacterized, to a role in Bacillus physiology: inhibiting spore development. IMPORTANCEBacilli are capable of producing a diverse array of specialized metabolites, many of which have gained attention for their roles as signals that affect bacterial physiology and development. Up to this point, however, the Bacillus genus's metabolic capacity has been underexplored. We undertook a deep genomic analysis of 1,566 Bacillus genomes to understand the full spectrum of metabolites that this bacterial group can make. We discovered that the majority of the specialized metabolites produced by Bacillus species are highly conserved, known compounds with important signaling roles in the physiology and development of this bacterium. Additionally, there is significant unique biosynthetic machinery distributed across the genus that might lead to new, unknown metabolites with diverse biological functions. Inspired by the findings of our genomic analysis, we speculate that the highly conserved alkylpyrones might have an important biological activity within this genus. We go on to validate this prediction by demonstrating that these natural products are developmental signals in Bacillus and act by inhibiting sporulation.
Project description:GC skew is a phenomenon observed in many bacterial genomes, wherein the two replication strands of the same chromosome contain different proportions of guanine and cytosine nucleotides. Here we demonstrate that this phenomenon, which was first discovered in the mid-1990s, can be used today as an analysis tool for the 15,000+ complete bacterial genomes in NCBI's Refseq library. In order to analyze all 15,000+ genomes, we introduce a new method, SkewIT (Skew Index Test), that calculates a single metric representing the degree of GC skew for a genome. Using this metric, we demonstrate how GC skew patterns are conserved within certain bacterial phyla, e.g. Firmicutes, but show different patterns in other phylogenetic groups such as Actinobacteria. We also discovered that outlier values of SkewIT highlight potential bacterial mis-assemblies. Using our newly defined metric, we identify multiple mis-assembled chromosomal sequences in previously published complete bacterial genomes. We provide a SkewIT web app https://jenniferlu717.shinyapps.io/SkewIT/ that calculates SkewI for any user-provided bacterial sequence. The web app also provides an interactive interface for the data generated in this paper, allowing users to further investigate the SkewI values and thresholds of the Refseq-97 complete bacterial genomes. Individual scripts for analysis of bacterial genomes are provided in the following repository: https://github.com/jenniferlu717/SkewIT.
Project description:Identifying genomic regions that descended from a common ancestor helps us study the gene function and genome evolution. In distantly related genomes, clusters of homologous gene pairs are evidently used in function prediction, operon detection, etc. Currently, there are many kinds of computational methods that have been proposed defining gene clusters to identify gene families and operons. However, most of those algorithms are only available on a data set of small size. We developed an efficient gene clustering algorithm that can be applied on hundreds of genomes at the same time. This approach allows for large-scale study of evolutionary relationships of gene clusters and study of operon formation and destruction. An analysis of proposed algorithms shows that more biological insight can be obtained by analyzing gene clusters across hundreds of genomes, which can help us understand operon occurrences, gene orientations and gene rearrangements.
Project description:BackgroundLineage-specific, or taxonomically restricted genes (TRGs), especially those that are species and strain-specific, are of special interest because they are expected to play a role in defining exclusive ecological adaptations to particular niches. Despite this, they are relatively poorly studied and little understood, in large part because many are still orphans or only have homologues in very closely related isolates. This lack of homology confounds attempts to establish the likelihood that a hypothetical gene is expressed and, if so, to determine the putative function of the protein.Methodology/principal findingsWe have developed "QIPP" ("Quality Index for Predicted Proteins"), an index that scores the "quality" of a protein based on non-homology-based criteria. QIPP can be used to assign a value between zero and one to any protein based on comparing its features to other proteins in a given genome. We have used QIPP to rank the predicted proteins in the proteomes of Bacteria and Archaea. This ranking reveals that there is a large amount of variation in QIPP scores, and identifies many high-scoring orphans as potentially "authentic" (expressed) orphans. There are significant differences in the distributions of QIPP scores between orphan and non-orphan genes for many genomes and a trend for less well-conserved genes to have lower QIPP scores.ConclusionsThe implication of this work is that QIPP scores can be used to further annotate predicted proteins with information that is independent of homology. Such information can be used to prioritize candidates for further analysis. Data generated for this study can be found in the OrphanMine at http://www.genomics.ceh.ac.uk/orphan_mine.
Project description:Insertion sequences (IS) are the simplest and most abundant form of transposable DNA found in bacterial genomes. When present in multiple copies, it is thought that they can promote genomic plasticity and genetic exchange, thus being a major force of evolutionary change. The main processes that determine IS content in genomes are, though, a matter of debate. In this work, we take advantage of the large amount of genomic data currently available and study the abundance distributions of 33 IS families in 1811 bacterial chromosomes. This allows us to test simple models of IS dynamics and estimate their key parameters by means of a maximum likelihood approach. We evaluate the roles played by duplication, lateral gene transfer, deletion and purifying selection. We find that the observed IS abundances are compatible with a neutral scenario where IS proliferation is controlled by deletions instead of purifying selection. Even if there may be some cases driven by selection, neutral behavior dominates over large evolutionary scales. According to this view, IS and hosts tend to coexist in a dynamic equilibrium state for most of the time. Our approach also allows for a detection of recent IS expansions, and supports the hypothesis that rapid expansions constitute transient events-punctuations-during which the state of coexistence of IS and host becomes perturbated.
Project description:RNA editing by adenosine deamination has been shown to generate multiple isoforms of several neural receptors, often with profound effects on receptor function. However, little is known about the regulation of editing activity during development. We have developed a large-scale RNA sequencing protocol to determine adenosine-to-inosine (A-to-I) editing frequencies in the coding region of genes in the mammalian brain. Using the 454 Life Sciences (Roche) Amplicon Sequencing technology, we were able to determine even low levels of editing with high accuracy. The efficiency of editing for 28 different sites was analyzed during the development of the mouse brain from embryogenesis to adulthood. We show that, with few exceptions, the editing efficiency is low during embryogenesis, increasing gradually at different rates up to the adult mouse. The variation in editing gave receptors like HTR2C and GABA(A) (gamma-aminobutyric acid type A) a different set of protein isoforms during development from those in the adult animal. Furthermore, we show that this regulation of editing activity cannot be explained by an altered expression of the ADAR proteins but, rather, by the presence of a regulatory network that controls the editing activity during development.