Project description:Genomes are an integral component of the biological information about an organism; thus, the more complete the genome, the more informative it is. Historically, bacterial and archaeal genomes were reconstructed from pure (monoclonal) cultures, and the first reported sequences were manually curated to completion. However, the bottleneck imposed by the requirement for isolates precluded genomic insights for the vast majority of microbial life. Shotgun sequencing of microbial communities, referred to initially as community genomics and subsequently as genome-resolved metagenomics, can circumvent this limitation by obtaining metagenome-assembled genomes (MAGs); but gaps, local assembly errors, chimeras, and contamination by fragments from other genomes limit the value of these genomes. Here, we discuss genome curation to improve and, in some cases, achieve complete (circularized, no gaps) MAGs (CMAGs). To date, few CMAGs have been generated, although notably some are from very complex systems such as soil and sediment. Through analysis of about 7000 published complete bacterial isolate genomes, we verify the value of cumulative GC skew in combination with other metrics to establish bacterial genome sequence accuracy. The analysis of cumulative GC skew identified potential misassemblies in some reference genomes of isolated bacteria and the repeat sequences that likely gave rise to them. We discuss methods that could be implemented in bioinformatic approaches for curation to ensure that metabolic and evolutionary analyses can be based on very high-quality genomes.
Project description:The genetic structure of the indigenous hunter-gatherer peoples of Southern Africa, the oldest known lineage of modern man, holds an important key to understanding humanity's early history. Previously sequenced human genomes have been limited to recently diverged populations. Here we present the first complete genome sequences of an indigenous hunter-gatherer from the Kalahari Desert and of a Bantu from Southern Africa, as well as protein-coding regions from an additional three hunter-gatherers from disparate regions of the Kalahari. We characterize the extent of whole-genome and exome diversity among the five men, reporting 1.3 million novel DNA differences genome-wide, and 13,146 novel amino-acid variants. These data allow genetic relationships among Southern African foragers and neighboring agriculturalists to be traced more accurately than was previously possible. Adding the described variants to current databases will facilitate inclusion of Southern Africans in medical research efforts.
Project description:The genetic structure of the indigenous hunter-gatherer peoples of Southern Africa, the oldest known lineage of modern man, holds an important key to understanding humanity's early history. Previously sequenced human genomes have been limited to recently diverged populations. Here we present the first complete genome sequences of an indigenous hunter-gatherer from the Kalahari Desert and of a Bantu from Southern Africa, as well as protein-coding regions from an additional three hunter-gatherers from disparate regions of the Kalahari. We characterize the extent of whole-genome and exome diversity among the five men, reporting 1.3 million novel DNA differences genome-wide, and 13,146 novel amino-acid variants. These data allow genetic relationships among Southern African foragers and neighboring agriculturalists to be traced more accurately than was previously possible. Adding the described variants to current databases will facilitate inclusion of Southern Africans in medical research efforts. Copy number differences between NA18507 and KB1 were predicted from the depth of whole-genome shotgun sequence reads. These predictions were then validated using array-CGH using a a genome-wide design as well as a custom design targeted at specific regions of copy number difference
Project description:We present a single-base-resolution sequencing methodology that will simultaneously sequence complete genetics and complete epigenetics in a single workflow. The approach is non-destructive to DNA and provides a digital readout of bases, which we exemplify by simultaneous sequencing of G, C, T, A, and 5mC/5hmC (5-letter sequencing) or 5mC and 5hmC (6-Letter sequencing). We demonstrate sequencing of human genomic DNA and also cell-free DNA taken from a blood sample of a cancer patient. The approach is accurate, requires low DNA input and has a simple workflow and analysis pipeline. We envisage it will be versatile across many applications in life sciences.
Project description:Viral infections remain a serious global health issue. Metagenomic approaches are increasingly used in the detection of novel viral pathogens but also to generate complete genomes of uncultivated viruses. In silico identification of complete viral genomes from sequence data would allow rapid phylogenetic characterization of these new viruses. Often, however, complete viral genomes are not recovered, but rather several distinct contigs derived from a single entity are, some of which have no sequence homology to any known proteins. De novo assembly of single viruses from a metagenome is challenging, not only because of the lack of a reference genome, but also because of intrapopulation variation and uneven or insufficient coverage. Here we explored different assembly algorithms, remote homology searches, genome-specific sequence motifs, k-mer frequency ranking, and coverage profile binning to detect and obtain viral target genomes from metagenomes. All methods were tested on 454-generated sequencing datasets containing three recently described RNA viruses with a relatively large genome which were divergent to previously known viruses from the viral families Rhabdoviridae and Coronaviridae. Depending on specific characteristics of the target virus and the metagenomic community, different assembly and in silico gap closure strategies were successful in obtaining near complete viral genomes.
Project description:Accurate functional annotation of regulatory elements is essential for understanding global gene regulation. Here, we report a genome-wide map of 827,000 transcription factor binding sites in human lymphoblastoid cell lines, which is comprised of sites correspond-ing to 239 position weight matrices of known transcription factor binding motifs, and 49 novel sequence motifs. To generate this map, we developed a probabilistic framework that integrates cell- or tissue-specific experimental data such as histone modifications and DNa-seI cleavage patterns with genomic information such as gene annotation and evolutionary conservation. Comparison to empirical ChIP-seq data suggests that our method is highly accurate yet has the advantage of targeting many factors in a single assay. We anticipate that this approach will be a valuable tool for genome-wide studies of gene regulation in a wide variety of cell-types or tissues under diverse conditions. DNaseI-Seq on two YRI Hapmap cell lines. Each individual sequenced on 8 lanes of the Illumina Genome Analyzer II
Project description:Infectious disease metagenomics is driven by the question: "what is causing the disease?" in contrast to classical metagenome studies which are guided by "what is out there?" In case of a novel virus, a first step to eventually establishing etiology can be to recover a full-length viral genome from a metagenomic sample. However, retrieval of a full-length genome of a divergent virus is technically challenging and can be time-consuming and costly. Here we discuss different assembly and fragment linkage strategies such as iterative assembly, motif searches, k-mer frequency profiling, coverage profile binning, and other strategies used to recover genomes of potential viral pathogens in a timely and cost-effective manner.
Project description:Like all organisms on the planet, environmental microbes are subject to the forces of molecular evolution. Metagenomic sequencing provides a means to access the DNA sequence of uncultured microbes. By combining DNA sequencing of microbial communities with evolutionary modeling and phylogenetic analysis we might obtain new insights into microbiology and also provide a basis for practical tools such as forensic pathogen detection. In this work we present an approach to leverage phylogenetic analysis of metagenomic sequence data to conduct several types of analysis. First, we present a method to conduct phylogeny-driven Bayesian hypothesis tests for the presence of an organism in a sample. Second, we present a means to compare community structure across a collection of many samples and develop direct associations between the abundance of certain organisms and sample metadata. Third, we apply new tools to analyze the phylogenetic diversity of microbial communities and again demonstrate how this can be associated to sample metadata. These analyses are implemented in an open source software pipeline called PhyloSift. As a pipeline, PhyloSift incorporates several other programs including LAST, HMMER, and pplacer to automate phylogenetic analysis of protein coding and RNA sequences in metagenomic datasets generated by modern sequencing platforms (e.g., Illumina, 454).