Project description:Liquid chromatography coupled to tandem mass spectrometry has become the main method for high-throughput identification and quantification of peptides and the inferred proteins. Discovery proteomics commonly employs data-dependent acquisition in combination with spectrum-centric analysis. The accumulation of data generated from thousands of samples by this method has approached saturation coverage of different proteomes. Recently, as a result of technological advances, methods based on data acquisition strategies compatible with peptide-centric scoring have also reached similar proteome coverage in individual runs, and scalability. This is exemplified by SWATH-MS, which combines data-independent acquisition (DIA) with targeted data extraction of groups of transitions uniquely detecting a peptide. As the data matrices generated by these experiments continue to grow with respect to both the number of peptides identified per sample and the number of samples analyzed per study, challenges for error rate control have emerged. Here, we discuss the adaptation of statistical concepts developed for discovery proteomics based on spectrum-centric scoring to large-scale DIA experiments analyzed with peptide-centric scoring strategies, and provide some guidance on their application. We propose that, in order to increase the quality and reproducibility of published proteomic results, well-established confidence criteria should be reported at each level as we progress from spectral evidence to identified or detected peptides and inferred proteins. These confidence criteria should equally be applied to proteomic analyses based on spectrum- and peptide-centric scoring strategies.
Project description:Liquid chromatography coupled to tandem mass spectrometry has become the main method for high-throughput identification and quantification of peptides and the inferred proteins. Discovery proteomics commonly employs data-dependent acquisition in combination with spectrum-centric analysis. The accumulation of data generated from thousands of samples by this method has approached saturation coverage of different proteomes. Recently, as a result of technological advances, methods based on data acquisition strategies compatible with peptide-centric scoring have also reached similar proteome coverage in individual runs, and scalability. This is exemplified by SWATH-MS, which combines data-independent acquisition (DIA) with targeted data extraction of groups of transitions uniquely detecting a peptide. As the data matrices generated by these experiments continue to grow with respect to both the number of peptides identified per sample and the number of samples analyzed per study, challenges for error rate control have emerged. Here, we discuss the adaptation of statistical concepts developed for discovery proteomics based on spectrum-centric scoring to large-scale DIA experiments analyzed with peptide-centric scoring strategies, and provide some guidance on their application. We propose that, in order to increase the quality and reproducibility of published proteomic results, well-established confidence criteria should be reported at each level as we progress from spectral evidence to identified or detected peptides and inferred proteins. These confidence criteria should equally be applied to proteomic analyses based on spectrum- and peptide-centric scoring strategies.
Project description:Long-read RNA sequencing (RNA-seq) holds great potential for characterizing transcriptome variation and full-length transcript isoforms, but the relatively high error rate of current long-read sequencing platforms poses a major challenge. We present ESPRESSO, a computational tool for robust discovery and quantification of transcript isoforms from error-prone long reads. ESPRESSO jointly considers alignments of all long reads aligned to a gene and uses error profiles of individual reads to improve the identification of splice junctions and the discovery of their corresponding transcript isoforms. On both a synthetic spike-in RNA sample and human RNA samples, ESPRESSO outperforms multiple contemporary tools in not only transcript isoform discovery but also transcript isoform quantification. In total, we generated and analyzed ~1.1 billion nanopore RNA-seq reads covering 30 human tissue samples and three human cell lines. ESPRESSO and its companion dataset provide a useful resource for studying the RNA repertoire of eukaryotic transcriptomes.
Project description:Coverage of short-read RNA-seq is highly non-uniform across transcripts even in genes with only one expressed isoform, contrary to biological expectation. We investigate the impact of several library preparation factors on the non-uniformity of coverage. Specifically, a mouse liver sample is prepared under varying ribosomal depletion (PolyA / rRNA digestion / rRNA pull-down / NoSelection), PCR ramp rate, and fragment length along with one heart sample without selection.
Project description:Nanopore sequencing has revolutionized genetic analysis by offering linkage information across megabase-scale genomes. However, the high intrinsic error rate of nanopore sequencing impedes the analysis of complex heterogeneous samples, such as viruses, bacteria, and edited cell lines. Achieving high accuracy in single-molecule sequence identification would significantly advance the study of quasi-species genomic populations, crucial for fields like immunology, pathology, epidemiology, and synthetic biology, where clonal isolation is traditionally employed for complete genomic frequency analysis. Here, we introduce ConSeqUMI, an innovative experimental and analytical pipeline designed to address long-read sequencing error rates using unique molecular indices for precise consensus sequence determination. ConSeqUMI processes nanopore sequencing data without the need for reference sequences, enabling accurate assembly of individual molecular sequences from complex mixtures. We establish robust benchmarking criteria for this platform’s performance and demonstrate its utility across diverse experimental contexts, including mixed plasmid pools, recombinant adeno-associated virus genome integrity, and CRISPR/Cas9-induced genomic alterations. Furthermore, ConSeqUMI enables detailed profiling of human pathogenic infections, as shown by our analysis of SARS-CoV-2 spike protein variants, revealing substantial intra-patient genetic heterogeneity. Lastly, we demonstrate how individual clonal isolates can be extracted directly from sequencing libraries at low cost, allowing for post-sequencing identification validation of observed variants. Our findings highlight the robustness of ConSeqUMI in processing sequencing data from degenerate UMI-labeled molecules, offering a critical tool for advancing genomic research.
Project description:To determine the error rate of mitochondrial transcription, we ananlyzed 33 and 37 million reads respectively for wild type (WT) and mutant (E423P) mitochondrial RNA polymerase (POLRMT) overexpression flies and found that the error frequency of mitochondrial transcripts were over 5 fold higher in E423P flies than that of WT. To gain more insight into the molecular mechanisms that drive the error rate of transcription by POLRMT, we examined its distribution of errors along the mitochondrial genome. We also evaluated mitochondrial RNA processing by quantifying the frequency of a single read spanning two adjacent genes. There was no significant increase of unprocessed RNAs in E423P than that of WT. These observations concluded that overexpression of E423P POLRMT in adult flies leads to a statistically significant increase of mitochondrial transcripts errors.