Project description:The use of reference DNA standards generated from cancer cell lines sequenced in the Cancer Genome Project to establish the sensitivity, specificity, accuracy and reproducibility of the WTSI GCLP sequencing pipeline
Project description:The recent SARS-CoV-2 omicron variant presented significant challenges to the global effort to counter the pandemic. SARS-CoV-2 is predicted to remain prevalent in the coming months, making the ability to identify SARS-CoV-2 variants imperative in understanding and controlling the pandemic. The predominant variant discovery method, genome sequencing, is time-consuming, insensitive, and expensive. Liquid chromatography-mass spectrometry (LC-MS) offers an exciting alternative detection modality provided that variant-containing peptide markers become well-established. This study demonstrates the potential to establish SARS-CoV-2 peptide markers by examining amino-acid variant-containing tryptic peptides, their MS fragmentation intensities, and their detection sensitivity in MS experiments. We have synthesized model tryptic peptides from of SARS-CoV-2 variants beta, gamma, delta, and omicron and evaluated their signal intensity, HCD spectra, and reverse phase retention time.
Project description:While the importance of random sequencing errors decreases at higher DNA or RNA sequencing depths, systematic sequencing errors (SSEs) dominate at high sequencing depths and can be difficult to distinguish from biological variants. These SSEs can cause base quality scores to underestimate the probability of error at certain genomic positions, resulting in false positive variant calls, particularly in mixtures such as samples with RNA editing, tumors, circulating tumor cells, bacteria, mitochondrial heteroplasmy, or pooled DNA. Most algorithms proposed for correction of SSEs require a training data set, which is typically either from a part of the data set being “recalibrated” (Genome Analysis ToolKit, or GATK) or from a separate data set with special characteristics (SysCall). Here, we combine the advantages of these approaches by adding synthetic RNA spike-in standards to human RNA, and use GATK to recalibrate base quality scores with reads mapped to the spike-in standards. Compared to conventional GATK recalibration that uses reads mapped to the genome, spike-ins improve the accuracy of Illumina base quality scores by a mean of 5 units, and by as much as 13 units at CpG sites. In addition, since reads mapping to the genome are not used for recalibration, our method allows run-specific recalibration even for the many species without a comprehensive and accurate SNP database. We also use GATK with the spike-in standards to demonstrate that the Illumina RNA sequencing runs overestimate quality scores for AC, CC, GC, GG, and TC dinucleotides, while SOLiD has less dinucleotide SSEs but more SSEs for certain cycles. We conclude that using these DNA and RNA spike-in standards with GATK improves base quality score recalibration.
Project description:While the importance of random sequencing errors decreases at higher DNA or RNA sequencing depths, systematic sequencing errors (SSEs) dominate at high sequencing depths and can be difficult to distinguish from biological variants. These SSEs can cause base quality scores to underestimate the probability of error at certain genomic positions, resulting in false positive variant calls, particularly in mixtures such as samples with RNA editing, tumors, circulating tumor cells, bacteria, mitochondrial heteroplasmy, or pooled DNA. Most algorithms proposed for correction of SSEs require a training data set, which is typically either from a part of the data set being M-bM-^@M-^\recalibratedM-bM-^@M-^] (Genome Analysis ToolKit, or GATK) or from a separate data set with special characteristics (SysCall). Here, we combine the advantages of these approaches by adding synthetic RNA spike-in standards to human RNA, and use GATK to recalibrate base quality scores with reads mapped to the spike-in standards. Compared to conventional GATK recalibration that uses reads mapped to the genome, spike-ins improve the accuracy of Illumina base quality scores by a mean of 5 units, and by as much as 13 units M-BM- at CpG sites. In addition, since reads mapping to the genome are not used for recalibration, our method allows run-specific recalibration even for the many species without a comprehensive and accurate SNP database. We also use GATK with the spike-in standards to demonstrate that the Illumina RNA sequencing runs overestimate quality scores for AC, CC, GC, GG, and TC dinucleotides, while SOLiD has less dinucleotide SSEs but more SSEs for certain cycles. We conclude that using these DNA and RNA spike-in standards with GATK improves base quality score recalibration. Four human RNA samples with equimolar ERCC spike-in standards were sequenced on Illumina. Two human brain/liver/muscle RNA mixtures with dynamic range of ERCC spike-in standards were sequenced on SOLiD.