Project description:Obtaining an unbiased view of the phylogenetic composition and functional diversity within a microbial community is one central objective of metagenomic analysis. New technologies, such as 454 pyrosequencing, have dramatically reduced sequencing costs, to a level where metagenomic analysis may become a viable alternative to more-focused assessments of the phylogenetic (e.g., 16S rRNA genes) and functional diversity of microbial communities. To determine whether the short (approximately 100 to 200 bp) sequence reads obtained from pyrosequencing are appropriate for the phylogenetic and functional characterization of microbial communities, the results of BLAST and COG analyses were compared for long (approximately 750 bp) and randomly derived short reads from each of two microbial and one virioplankton metagenome libraries. Overall, BLASTX searches against the GenBank nr database found far fewer homologs within the short-sequence libraries. This was especially pronounced for a Chesapeake Bay virioplankton metagenome library. Increasing the short-read sampling depth or the length of derived short reads (up to 400 bp) did not completely resolve the discrepancy in BLASTX homolog detection. Only in cases where the long-read sequence had a close homolog (low BLAST E-score) did the derived short-read sequence also find a significant homolog. Thus, more-distant homologs of microbial and viral genes are not detected by short-read sequences. Among COG hits, derived short reads sampled at a depth of two short reads per long read missed up to 72% of the COG hits found using long reads. Noting the current limitation in computational approaches for the analysis of short sequences, the use of short-read-length libraries does not appear to be an appropriate tool for the metagenomic characterization of microbial communities.
Project description:BackgroundDifferent high-throughput nucleic acid sequencing platforms are currently available but a trade-off currently exists between the cost and number of reads that can be generated versus the read length that can be achieved.Methodology/principal findingsWe describe an experimental and computational pipeline yielding millions of reads that can exceed 200 bp with quality scores approaching that of traditional Sanger sequencing. The method combines an automatable gel-less library construction step with paired-end sequencing on a short-read instrument. With appropriately sized library inserts, mate-pair sequences can overlap, and we describe the SHERA software package that joins them to form a longer composite read.Conclusions/significanceThis strategy is broadly applicable to sequencing applications that benefit from low-cost high-throughput sequencing, but require longer read lengths. We demonstrate that our approach enables metagenomic analyses using the Illumina Genome Analyzer, with low error rates, and at a fraction of the cost of pyrosequencing.
Project description:BackgroundThere are many programs available for generating simulated whole-genome shotgun sequence reads. The data generated by many of these programs follow predefined models, which limits their use to the authors' original intentions. For example, many models assume that read lengths follow a uniform or normal distribution. Other programs generate models from actual sequencing data, but are limited to reads from single-genome studies. To our knowledge, there are no programs that allow a user to generate simulated data following non-parametric read-length distributions and quality profiles based on empirically-derived information from metagenomics sequencing data.ResultsWe present BEAR (Better Emulation for Artificial Reads), a program that uses a machine-learning approach to generate reads with lengths and quality values that closely match empirically-derived distributions. BEAR can emulate reads from various sequencing platforms, including Illumina, 454, and Ion Torrent. BEAR requires minimal user input, as it automatically determines appropriate parameter settings from user-supplied data. BEAR also uses a unique method for deriving run-specific error rates, and extracts useful statistics from the metagenomic data itself, such as quality-error models. Many existing simulators are specific to a particular sequencing technology; however, BEAR is not restricted in this way. Because of its flexibility, BEAR is particularly useful for emulating the behaviour of technologies like Ion Torrent, for which no dedicated sequencing simulators are currently available. BEAR is also the first metagenomic sequencing simulator program that automates the process of generating abundances, which can be an arduous task.ConclusionsBEAR is useful for evaluating data processing tools in genomics. It has many advantages over existing comparable software, such as generating more realistic reads and being independent of sequencing technology, and has features particularly useful for metagenomics work.
Project description:long-read CAGE was design to identify full length capped transcript across 10 specific loci in cortical neurones. Long-read CAGE was based on the Cap-Trapper method with the full length cDNA sequencing using ONT MinION sequencer. After RNA extraction, 10 µg total RNAs from Human iPS (WTC-11) cells, differentiated neural stem cells and differentiated cortical neuron cells were polyadenylated with E-coli poly(A) Polymerase (PAP) (NEB M0276) at 37°C for 15 min and purified with AMPure RNA Clean XP beads. The PAP treated 5 µg RNA was reverse transcribed with oligodT_16VN_UMI25_primer (GAGATGTCTCGTGGGCTCGGNNNNNNNNNNNNNNNNNNNNNNNNNCTACGTTTTTTTTTTTTTTTTVN) and Prime Script II Reverse Transcriptase (Takara Bio) at 42°C for 60 min and purified with RNAClean XP beads. Cap-trapping from the RNA/cDNA hybrids was performed with published protocol (Takahashi et al., Nature protocols, 2012 (https://doi.org/10.1038/nprot.2012.005)), and RNA was digested with RNase H (Takara Bio) at 37°C for 30 min and purified with AMPureXP beads. 5’ linker (N6 up GTGGTATCAACGCAGAGTACNNNNNN-Phos, GN5 up GTGGTATCAACGCAGAGTACGNNNNN-Phos, down Phos-GTACTCTGCGTTGATACCAC-Phos) was ligated to the cDNA with Mighty Mix (Takara Bio) for overnight and the ligated cDNA was purified with AMPure XP beads. Shrimp Alkaline Phosphatase (Takara Bio) was used to remove phosphates at the ligated linker and purified with AMPureXP beads. The 5’ linker ligated cDNA was then second strand synthesized with KAPA HiFi mix (Roche) and 2nd synthesis primer_UMI15 at 95°C for 5 min, 55°C for 5 min and 72°C for 30 min. Exonuclease I (Takara Bio) was added for the primer digestion at 37°C for 30 min, and the cDNA/DNA hybrid was purified with AMPureXP and amplified with PrimerSTAR GXL DNA polymerase (Takara Bio) and PCR primer (fwd_CTACACTCGTCGGCAGCGTC, rev _GAGATGTCTCGTGGGCTCGG) for 7 cycles. The library was then treated with SQK-LSK110 (Oxford Nanopore Technologies) with manufacture’s protocol and sequenced with R9.4 flowcell (FLO-MIN106) in MinION sequencer. Basecalling was processed by Guppy v5.0.14 basecaller software provided by Oxford Nanopore Technologies to generate fastq files from FAST5 files. To prepare clean reads from fastq files, adapter sequence was trimmed by pychopper (https://github.com/nanoporetech/pychopper) with VNP_GAGATGTCTCGTGGGCTCGGNNNNNNNNNNNNNNNCTACG and SSP_ CTACACTCGTCGGCAGCGTCNNNNNNNNNNNNNNNNNNNNNNNNNGTGGTATCAACGCAGAGTAC and the fastq was mapped on our target genes.
Project description:Here we report three complete bacterial genome assemblies from a PacBio shotgun metagenome of a co-culture from Upper Klamath Lake, OR. Genome annotations and culture conditions indicate these bacteria are dependent on carbon and nitrogen fixation from the cyanobacterium Aphanizomenon flos-aquae, whose genome was assembled to draft-quality. Due to their taxonomic novelty relative to previously sequenced bacteria, we have temporarily designated these bacteria as incertae sedis Hyphomonadaceae strain UKL13-1 (3,501,508 bp and 56.12% GC), incertae sedis Betaproteobacterium strain UKL13-2 (3,387,087 bp and 54.98% GC), and incertae sedis Bacteroidetes strain UKL13-3 (3,236,529 bp and 37.33% GC). Each genome consists of a single circular chromosome with no identified plasmids. When compared with binned Illumina assemblies of the same three genomes, there was ~7% discrepancy in total genome length. Gaps where Illumina assemblies broke were often due to repetitive elements. Within these missing sequences were essential genes and genes associated with a variety of functional categories. Annotated gene content reveals that both Proteobacteria are aerobic anoxygenic phototrophs, with Betaproteobacterium UKL13-2 potentially capable of phototrophic oxidation of sulfur compounds. Both proteobacterial genomes contain transporters suggesting they are scavenging fixed nitrogen from A. flos-aquae in the form of ammonium. Bacteroidetes UKL13-3 has few completely annotated biosynthetic pathways, and has a comparatively higher proportion of unannotated genes. The genomes were detected in only a few other freshwater metagenomes, suggesting that these bacteria are not ubiquitous in freshwater systems. Our results indicate that long-read sequencing is a viable method for sequencing dominant members from low-diversity microbial communities, and should be considered for environmental metagenomics when conditions meet these requirements.
Project description:Long-read sequencing technologies have now reached a level of accuracy and yield that allows their application to variant detection at a scale of tens to thousands of samples. Concomitant with the development of new computational tools, the first population-scale studies involving long-read sequencing have emerged over the past 2 years and, given the continuous advancement of the field, many more are likely to follow. In this Review, we survey recent developments in population-scale long-read sequencing, highlight potential challenges of a scaled-up approach and provide guidance regarding experimental design. We provide an overview of current long-read sequencing platforms, variant calling methodologies and approaches for de novo assemblies and reference-based mapping approaches. Furthermore, we summarize strategies for variant validation, genotyping and predicting functional impact and emphasize challenges remaining in achieving long-read sequencing at a population scale.
Project description:Microbial secondary metabolites play crucial roles in microbial competition, communication, resource acquisition, antibiotic production, and a variety of other biotechnological processes. The retrieval of full-length BGC (biosynthetic gene cluster) sequences from uncultivated bacteria is difficult due to the technical constraints of short-read sequencing, making it impossible to determine BGC diversity. Using long-read sequencing and genome mining, 339 mainly full-length BGCs were recovered in this study, illuminating the wide range of BGCs from uncultivated lineages discovered in seawater from Aoshan Bay, Yellow Sea, China. Many extremely diverse BGCs were discovered in bacterial phyla such as Proteobacteria, Bacteroidota, Acidobacteriota, and Verrucomicrobiota as well as the previously uncultured archaeal phylum "Candidatus Thermoplasmatota." The data from metatranscriptomics showed that 30.1% of secondary metabolic genes were being expressed, and they also revealed the expression pattern of BGC core biosynthetic genes and tailoring enzymes. Taken together, our results demonstrate that long-read metagenomic sequencing combined with metatranscriptomic analysis provides a direct view into the functional expression of BGCs in environmental processes. IMPORTANCE Genome mining of metagenomic data has become the preferred method for the bioprospecting of novel compounds by cataloguing secondary metabolite potential. However, the accurate detection of BGCs requires unfragmented genomic assemblies, which have been technically difficult to obtain from metagenomes until recently with new long-read technologies. We used high-quality metagenome-assembled genomes generated from long-read data to determine the biosynthetic potential of microbes found in the surface water of the Yellow Sea. We recovered 339 highly diverse and mostly full-length BGCs from largely uncultured and underexplored bacterial and archaeal phyla. Additionally, we present long-read metagenomic sequencing combined with metatranscriptomic analysis as a potential method for gaining access to the largely underutilized genetic reservoir of specialized metabolite gene clusters in the majority of microbes that are not cultured. The combination of long-read metagenomic and metatranscriptomic analyses is significant because it can more accurately assess the mechanisms of microbial adaptation to the environment through BGC expression based on metatranscriptomic data.
Project description:BackgroundLong-read sequencing in metagenomics facilitates the assembly of complete genomes out of complex microbial communities. These genomes include essential biologic information such as the ribosomal genes or the mobile genetic elements, which are usually missed with short-reads. We applied long-read metagenomics with Nanopore sequencing to retrieve high-quality metagenome-assembled genomes (HQ MAGs) from a dog fecal sample.ResultsWe used nanopore long-read metagenomics and frameshift aware correction on a canine fecal sample and retrieved eight single-contig HQ MAGs, which were > 90% complete with < 5% contamination, and contained most ribosomal genes and tRNAs. At the technical level, we demonstrated that a high-molecular-weight DNA extraction improved the metagenomics assembly contiguity, the recovery of the rRNA operons, and the retrieval of longer and circular contigs that are potential HQ MAGs. These HQ MAGs corresponded to Succinivibrio, Sutterella, Prevotellamassilia, Phascolarctobacterium, Catenibacterium, Blautia, and Enterococcus genera. Linking our results to previous gastrointestinal microbiome reports (metagenome or 16S rRNA-based), we found that some bacterial species on the gastrointestinal tract seem to be more canid-specific -Succinivibrio, Prevotellamassilia, Phascolarctobacterium, Blautia_A sp900541345-, whereas others are more broadly distributed among animal and human microbiomes -Sutterella, Catenibacterium, Enterococcus, and Blautia sp003287895. Sutterella HQ MAG is potentially the first reported genome assembly for Sutterella stercoricanis, as assigned by 16S rRNA gene similarity. Moreover, we show that long reads are essential to detect mobilome functions, usually missed in short-read MAGs.ConclusionsWe recovered eight single-contig HQ MAGs from canine feces of a healthy dog with nanopore long-reads. We also retrieved relevant biological insights from these specific bacterial species previously missed in public databases, such as complete ribosomal operons and mobilome functions. The high-molecular-weight DNA extraction improved the assembly's contiguity, whereas the high-accuracy basecalling, the raw read error correction, the assembly polishing, and the frameshift correction reduced the insertion and deletion errors. Both experimental and analytical steps ensured the retrieval of complete bacterial genomes.