High throughput error correction using dual nucleotide dimer blocks allows direct single-cell nanopore transcriptome sequencing
Ontology highlight
ABSTRACT: Droplet-based single-cell sequencing techniques have provided unprecedented insight into cellular heterogeneities within tissues. However, these approaches only allow for the measurement of the distal parts of a transcript following short-read sequencing. Therefore, splicing and sequence diversity information is lost for the majority of the transcript. The application of long-read Nanopore sequencing to droplet-based methods is challenging because of the low base-calling accuracy currently associated with Nanopore sequencing. Although several approaches that use additional short-read sequencing to error-correct the barcode and UMI sequences have been developed, these techniques are limited by the requirement to sequence a library using both short- and long-read sequencing. Here we introduce a novel approach termed single-cell Barcode UMI Correction sequencing (scBUC-seq) to efficiently error-correct barcode and UMI oligonucleotide sequences synthesized by using blocks of dimeric nucleotides. The method can be applied to correct both short-read and long-read sequencing, thereby allowing users to recover more reads per cell that permits direct single-cell Nanopore sequencing for the first time. We illustrate our method by using species-mixing experiments to evaluate barcode assignment accuracy and multiple myeloma cell lines to evaluate differential isoform usage and Ewing’s sarcoma cells to demonstrate Ig fusion transcript analysis.
Project description:Single-cell transcriptomics, reliant on the incorporation of barcodes and unique molecular identifiers (UMIs) into captured polyA+ mRNA, faces a significant challenge due to synthesis errors in oligonucleotide capture sequences. These inaccuracies, which are especially problematic in long-read sequencing, impair the precise identification of sequences and result in inaccuracies in UMI deduplication. To mitigate this issue, we have modified the oligonucleotide capture design, which integrates an interposed anchor between the barcode and UMI, and a 'V' base anchor adjacent to the polyA capture region. This configuration is devised to ensure compatibility with both short and long-read sequencing technologies, facilitating improved UMI recovery and enhanced feature detection, thereby improving the efficacy of droplet-based sequencing methods.
Project description:Long-read RNA sequencing (RNA-seq) holds great potential for characterizing transcriptome variation and full-length transcript isoforms, but the relatively high error rate of current long-read sequencing platforms poses a major challenge. We present ESPRESSO, a computational tool for robust discovery and quantification of transcript isoforms from error-prone long reads. ESPRESSO jointly considers alignments of all long reads aligned to a gene and uses error profiles of individual reads to improve the identification of splice junctions and the discovery of their corresponding transcript isoforms. On both a synthetic spike-in RNA sample and human RNA samples, ESPRESSO outperforms multiple contemporary tools in not only transcript isoform discovery but also transcript isoform quantification. In total, we generated and analyzed ~1.1 billion nanopore RNA-seq reads covering 30 human tissue samples and three human cell lines. ESPRESSO and its companion dataset provide a useful resource for studying the RNA repertoire of eukaryotic transcriptomes.
Project description:Here we compare the performance of these three approaches (inDrop, Drop-seq and 10x) using the same kind of sample with a unified data processing pipeline. We generated 2-3 replicates for each method using lymphoblastoid cell line GM12891. The average sequencing depth was around 50-60k reads per cell barcode. We also developed a versatile and rapid data processing workflow and applied it for all datasets. Cell capture efficiency, effective read ratio, barcode detection error and transcript detection sensitivity were analyzed as well.
Project description:In this study, 7530 newborn pancreatic β-cells were analyzed by single-cell sequencing. Cell Ranger was used to compare the original sequencing data, count the genome, filter background cells and cell transcript UMI, and use cell barcode to generate gene-barcode matrix. Then the samples were grouped, gene expression analysis, etc., and the statistical results of each sample sequencing data were output
Project description:Nanopore sequencing has revolutionized genetic analysis by offering linkage information across megabase-scale genomes. However, the high intrinsic error rate of nanopore sequencing impedes the analysis of complex heterogeneous samples, such as viruses, bacteria, and edited cell lines. Achieving high accuracy in single-molecule sequence identification would significantly advance the study of quasi-species genomic populations, crucial for fields like immunology, pathology, epidemiology, and synthetic biology, where clonal isolation is traditionally employed for complete genomic frequency analysis. Here, we introduce ConSeqUMI, an innovative experimental and analytical pipeline designed to address long-read sequencing error rates using unique molecular indices for precise consensus sequence determination. ConSeqUMI processes nanopore sequencing data without the need for reference sequences, enabling accurate assembly of individual molecular sequences from complex mixtures. We establish robust benchmarking criteria for this platform’s performance and demonstrate its utility across diverse experimental contexts, including mixed plasmid pools, recombinant adeno-associated virus genome integrity, and CRISPR/Cas9-induced genomic alterations. Furthermore, ConSeqUMI enables detailed profiling of human pathogenic infections, as shown by our analysis of SARS-CoV-2 spike protein variants, revealing substantial intra-patient genetic heterogeneity. Lastly, we demonstrate how individual clonal isolates can be extracted directly from sequencing libraries at low cost, allowing for post-sequencing identification validation of observed variants. Our findings highlight the robustness of ConSeqUMI in processing sequencing data from degenerate UMI-labeled molecules, offering a critical tool for advancing genomic research.
Project description:Monosome and disome profiling was performed on Flag-STAU1 Flp-In 293 T-REx to study the causes of ribosomal collisions, and whether this may be modulated by the presence/absence of Staufen-1. Cells were treated with either an siRNA targeting STAU1 transcript (4x samples) or a control siRNA (2x samples). Two of the four samples treated with the STAU1 siRNA had siRNA-resistant STAU1 mRNA expression induced by doxycycline (rescue). Sequencing libraries from monosome and disome fractions were generated in parallel from the same samples. Note that unique molecular identifiers/random barcodes (UMIs/RBCs) were included in the sequencing experiment. Each UMI has been moved to the fastq read name of each read. For example \\"xxxxxxrbc:AGCCAAT\\" in the read name signifies that the given read had a UMI of \\"AGCCAAT\\". Using these UMIs, PCR duplicates can be removed with UMI-Tools following read alignment.
Project description:Microsatellites are short tandem repeats (STRs) of a motif of 1 to 6 nucleotides that are ubiquitous in almost all genomes and widely used in many biomedical applications. However, despite the development of next-generation sequencing (NGS) over the past two decades with new technologies coming to the market, accurately sequencing and genotyping STRs, particularly homopolymers, are still very challenging today due to several technical limitations. This leads in many cases to erroneous allele calls and difficulty in correctly identifying the genuine allele distribution in a sample. In the present study, we assessed several second and third NGS approaches in their capability to correctly determine the length of microsatellites using plasmids containing A/T homopolymers, AC/TG or AT/TA dinucleotide STRs of variable length. Standard PCR-free and PCR-containing, single Unique Molecular Index (UMI) and dual UMI ‘duplex sequencing’ protocols were evaluated using Illumina short-read sequencing, and two PCR-free protocols using PacBio and nanopore long-read sequencing. Several bioinformatics algorithms were developed to correctly identify microsatellite alleles from sequencing data, including four and two modes for generating standard and combined consensus alleles, respectively. We provided a detailed analysis and comparison of these approaches and made several recommendations for the accurate determination of microsatellite allele length.
Project description:Microsatellites are short tandem repeats (STRs) of a motif of 1 to 6 nucleotides that are ubiquitous in almost all genomes and widely used in many biomedical applications. However, despite the development of next-generation sequencing (NGS) over the past two decades with new technologies coming to the market, accurately sequencing and genotyping STRs, particularly homopolymers, are still very challenging today due to several technical limitations. This leads in many cases to erroneous allele calls and difficulty in correctly identifying the genuine allele distribution in a sample. In the present study, we assessed several second and third NGS approaches in their capability to correctly determine the length of microsatellites using plasmids containing A/T homopolymers, AC/TG or AT/TA dinucleotide STRs of variable length. Standard PCR-free and PCR-containing, single Unique Molecular Index (UMI) and dual UMI ‘duplex sequencing’ protocols were evaluated using Illumina short-read sequencing, and two PCR-free protocols using PacBio and nanopore long-read sequencing. Several bioinformatics algorithms were developed to correctly identify microsatellite alleles from sequencing data, including four and two modes for generating standard and combined consensus alleles, respectively. We provided a detailed analysis and comparison of these approaches and made several recommendations for the accurate determination of microsatellite allele length.
Project description:Microsatellites are short tandem repeats (STRs) of a motif of 1 to 6 nucleotides that are ubiquitous in almost all genomes and widely used in many biomedical applications. However, despite the development of next-generation sequencing (NGS) over the past two decades with new technologies coming to the market, accurately sequencing and genotyping STRs, particularly homopolymers, are still very challenging today due to several technical limitations. This leads in many cases to erroneous allele calls and difficulty in correctly identifying the genuine allele distribution in a sample. In the present study, we assessed several second and third NGS approaches in their capability to correctly determine the length of microsatellites using plasmids containing A/T homopolymers, AC/TG or AT/TA dinucleotide STRs of variable length. Standard PCR-free and PCR-containing, single Unique Molecular Index (UMI) and dual UMI ‘duplex sequencing’ protocols were evaluated using Illumina short-read sequencing, and two PCR-free protocols using PacBio and nanopore long-read sequencing. Several bioinformatics algorithms were developed to correctly identify microsatellite alleles from sequencing data, including four and two modes for generating standard and combined consensus alleles, respectively. We provided a detailed analysis and comparison of these approaches and made several recommendations for the accurate determination of microsatellite allele length.
Project description:Microsatellites are short tandem repeats (STRs) of a motif of 1 to 6 nucleotides that are ubiquitous in almost all genomes and widely used in many biomedical applications. However, despite the development of next-generation sequencing (NGS) over the past two decades with new technologies coming to the market, accurately sequencing and genotyping STRs, particularly homopolymers, are still very challenging today due to several technical limitations. This leads in many cases to erroneous allele calls and difficulty in correctly identifying the genuine allele distribution in a sample. In the present study, we assessed several second and third NGS approaches in their capability to correctly determine the length of microsatellites using plasmids containing A/T homopolymers, AC/TG or AT/TA dinucleotide STRs of variable length. Standard PCR-free and PCR-containing, single Unique Molecular Index (UMI) and dual UMI ‘duplex sequencing’ protocols were evaluated using Illumina short-read sequencing, and two PCR-free protocols using PacBio and nanopore long-read sequencing. Several bioinformatics algorithms were developed to correctly identify microsatellite alleles from sequencing data, including four and two modes for generating standard and combined consensus alleles, respectively. We provided a detailed analysis and comparison of these approaches and made several recommendations for the accurate determination of microsatellite allele length.