Project description:Third-generation sequencing technologies provided by Pacific Biosciences and Oxford Nanopore Technologies generate read lengths in the scale of kilobasepairs. However, these reads display high error rates, and correction steps are necessary to realize their great potential in genomics and transcriptomics. Here, we compare properties of PacBio and Nanopore data and assess correction methods by Canu, MARVEL and proovread in various combinations. We found total error rates of around 13% in the raw datasets. PacBio reads showed a high rate of insertions (around 8%) whereas Nanopore reads showed similar rates for substitutions, insertions and deletions of around 4% each. In data from both technologies the errors were uniformly distributed along reads apart from noisy 5' ends, and homopolymers appeared among the most over-represented kmers relative to a reference. Consensus correction using read overlaps reduced error rates to about 1% when using Canu or MARVEL after patching. The lowest error rate in Nanopore data (0.45%) was achieved by applying proovread on MARVEL-patched data including Illumina short-reads, and the lowest error rate in PacBio data (0.42%) was the result of Canu correction with minimap2 alignment after patching. Our study provides valuable insights and benchmarks regarding long-read data and correction methods.
Project description:Single-cell proteomics by mass spectrometry (MS) is emerging as a powerful and unbiased method for the characterization of biological heterogeneity. So far, it has been limited to cultured cells, whereas an expansion of the method to complex tissues would greatly enhance biological insights. Here we describe single-cell Deep Visual Proteomics (scDVP), a technology that integrates high-content imaging, laser microdissection and multiplexed MS. scDVP resolves the context-dependent, spatial proteome of murine hepatocytes at a current depth of 1,700 proteins from a slice of a cell. Half of the proteome was differentially regulated in a spatial manner, with protein levels changing dramatically in proximity to the central vein. We applied machine learning to proteome classes and images, which subsequently inferred the spatial proteome from imaging data alone. scDVP is applicable to healthy and diseased tissues and complements other spatial proteomics or spatial omics technologies.
Project description:Background: Data sets from long-read sequencing platforms (Oxford Nanopore Technologies and Pacific Biosciences) allow for most prokaryote genomes to be completely assembled - one contig per chromosome or plasmid. However, the high per-read error rate of long-read sequencing necessitates different approaches to assembly than those used for short-read sequencing. Multiple assembly tools (assemblers) exist, which use a variety of algorithms for long-read assembly. Methods: We used 500 simulated read sets and 120 real read sets to assess the performance of eight long-read assemblers (Canu, Flye, Miniasm/Minipolish, NECAT, NextDenovo/NextPolish, Raven, Redbean and Shasta) across a wide variety of genomes and read parameters. Assemblies were assessed on their structural accuracy/completeness, sequence identity, contig circularisation and computational resources used. Results: Canu v2.1 produced reliable assemblies and was good with plasmids, but it performed poorly with circularisation and had the longest runtimes of all assemblers tested. Flye v2.8 was also reliable and made the smallest sequence errors, though it used the most RAM. Miniasm/Minipolish v0.3/v0.1.3 was the most likely to produce clean contig circularisation. NECAT v20200803 was reliable and good at circularisation but tended to make larger sequence errors. NextDenovo/NextPolish v2.3.1/v1.3.1 was reliable with chromosome assembly but bad with plasmid assembly. Raven v1.3.0 was reliable for chromosome assembly, though it did not perform well on small plasmids and had circularisation issues. Redbean v2.5 and Shasta v0.7.0 were computationally efficient but more likely to produce incomplete assemblies. Conclusions: Of the assemblers tested, Flye, Miniasm/Minipolish, NextDenovo/NextPolish and Raven performed best overall. However, no single tool performed well on all metrics, highlighting the need for continued development on long-read assembly algorithms.
Project description:These data were used in the spatial transcriptomics analysis of the article titled \\"Single-Cell and Spatial Transcriptomics Analysis of Human Adrenal Aging\\".
Project description:A high-confidence, comprehensive human variant set is critical in assessing accuracy of sequencing algorithms, which are crucial in precision medicine based on high-throughput sequencing. Although recent works have attempted to provide such a resource, they still do not encompass all major types of variants including structural variants (SVs). Thus, we leveraged the massive high-quality Sanger sequences from the HuRef genome to construct by far the most comprehensive gold set of a single individual, which was cross validated with deep Illumina sequencing, population datasets, and well-established algorithms. It was a necessary effort to completely reanalyze the HuRef genome as its previously published variants were mostly reported five years ago, suffering from compatibility, organization, and accuracy issues that prevent their direct use in benchmarking. Our extensive analysis and validation resulted in a gold set with high specificity and sensitivity. In contrast to the current gold sets of the NA12878 or HS1011 genomes, our gold set is the first that includes small variants, deletion SVs and insertion SVs up to a hundred thousand base-pairs. We demonstrate the utility of our HuRef gold set to benchmark several published SV detection tools.
Project description:Spatial localization is a key determinant of cellular fate and behavior, but spatial RNA assays traditionally rely on staining for a limited number of RNA species. In contrast, single-cell RNA-seq allows for deep profiling of cellular gene expression, but established methods separate cells from their native spatial context. Here we present Seurat, a computational strategy to infer cellular localization by integrating single-cell RNA-seq data with in situ RNA patterns. We applied Seurat to spatially map 851 single cells from dissociated zebrafish (Danio rerio) embryos, inferring a transcriptome-wide map of spatial patterning. We confirmed Seurat’s accuracy using several experimental approaches, and used it to identify a set of archetypal expression patterns and spatial markers. Additionally, Seurat correctly localizes rare subpopulations, accurately mapping both spatially restricted and scattered groups. Seurat will be applicable to mapping cellular localization within complex patterned tissues in diverse systems. We generated single-cell RNA-seq profiles from dissociated cells from developing zebrafish embryos (late blastula stage - 50% epiboly)
Project description:Background: Sequencing quality has improved over the last decade for long-reads, allowing for more accurate detection of somatic low-frequency variants. In this study, we used mixtures of mitochondrial samples with different haplogroups (i.e., a specific set of mitochondrial variants) to investigate the applicability of nanopore sequencing for low-frequency single nucleotide variant detection. Methods: We investigated the impact of base-calling, alignment/mapping, quality control steps, and variant calling by comparing the results to a previously derived short-read gold standard generated on the Illumina NextSeq. For nanopore sequencing, six mixtures of four different haplotypes were prepared, allowing us to reliably check for expected variants at the predefined 5%, 2%, and 1% mixture levels. We used two different versions of Guppy for base-calling, two aligners (i.e., Minimap2 and Ngmlr), and three variant callers (i.e., Mutserve2, Freebayes, and Nanopanel2) to compare low-frequency variants. We used F1 score measurements to assess the performance of variant calling. Results: We observed a mean read length of 11 kb and a mean overall read quality of 15. Ngmlr showed not only higher F1 scores but also higher allele frequencies (AF) of false-positive calls across the mixtures (mean F1 score = 0.83; false-positive allele frequencies < 0.17) compared to Minimap2 (mean F1 score = 0.82; false-positive AF < 0.06). Mutserve2 had the highest F1 scores (5% level: F1 score >0.99, 2% level: F1 score >0.54, and 1% level: F1 score >0.70) across all callers and mixture levels. Conclusion: We here present the benchmarking for low-frequency variant calling with nanopore sequencing by identifying current limitations.
Project description:We systematically benchmarked 8 single-cell ATAC sequencing technologies for their capacity to generate high-quality single-cell open chromatin profiles, and to elucidate the regulatory landscape of complex samples. Our study contains 47 individual human PBMC scATAC-seq experiments from a reference male and female donor. To streamline data analysis, we devised PUMATAC (https://github.com/aertslab/PUMATAC), a flexible and universal data analysis pipeline and best practices repository for scATAC-seq. Systematic technology-specific differences in sequencing library complexity and biases in tagmentation specificity were found to impact the accuracy of cell type annotation, genotype demultiplexing, peak calling, differential region accessibility, and motif enrichment. Together, our data forms a new scATAC-seq reference of more than 169, 000 PBMC cells with matched single-cell multiome and RNA-seq data.
Project description:BackgroundStudies in vertebrate genomics require sampling from a broad range of tissue types, taxa, and localities. Recent advancements in long-read and long-range genome sequencing have made it possible to produce high-quality chromosome-level genome assemblies for almost any organism. However, adequate tissue preservation for the requisite ultra-high molecular weight DNA (uHMW DNA) remains a major challenge. Here we present a comparative study of preservation methods for field and laboratory tissue sampling, across vertebrate classes and different tissue types.ResultsWe find that storage temperature was the strongest predictor of uHMW fragment lengths. While immediate flash-freezing remains the sample preservation gold standard, samples preserved in 95% EtOH or 20-25% DMSO-EDTA showed little fragment length degradation when stored at 4°C for 6 hours. Samples in 95% EtOH or 20-25% DMSO-EDTA kept at 4°C for 1 week after dissection still yielded adequate amounts of uHMW DNA for most applications. Tissue type was a significant predictor of total DNA yield but not fragment length. Preservation solution had a smaller but significant influence on both fragment length and DNA yield.ConclusionWe provide sample preservation guidelines that ensure sufficient DNA integrity and amount required for use with long-read and long-range sequencing technologies across vertebrates. Our best practices generated the uHMW DNA needed for the high-quality reference genomes for phase 1 of the Vertebrate Genomes Project, whose ultimate mission is to generate chromosome-level reference genome assemblies of all ∼70,000 extant vertebrate species.
Project description:We have created a synthetic crosslinked peptide library to benchmark crosslinking mass spectrometry search engines. The unique benefit of the library is knowing which identified crosslinks are true and which are false. The data collected from mass spectrometry measurements of the peptide library were used to assess the most frequently used search algorithms. The datasets included will provide an important resource for the crosslinking community to evaluate and optimise search engines, results from which have far-reaching implications.