Project description:The LRGASP challenge encompasses different human, mouse, and manatee samples sequenced using multiple combinations of protocols and platforms. Different challenges will use distinct subsets of the samples for evaluation. The long-read sequencing platforms used in these challenges are the Pacific Biosciences (PacBio) Sequel II, Oxford Nanopore (ONT) MinION and PromethION. Samples will also be sequenced on the Illumina HiSeq 2500. The primary LRGASP library prep protocols are “standard” cDNA sequencing, direct RNA sequencing, R2C2, and CapTrap. Each sample will also include Lexogen SIRV-Set 4 spike-ins. We will also provide simulated PacBio and ONT data as part of the evaluations. This particular study focuses on single strand CAGE sequencing of human iPSCs, defining CAGE peaks from Illumina HiSeq 2500 (SR: 150 cycles) of two biological replicates for use in the LRGASP challenge.
Project description:The purpose of this work was to describe a computational and analytical methodology for profiling small RNA by high-throughput sequencing. The datasets here were used to develop synthetic oligoribonucleotides as spike-in standards.
Project description:The purpose of this work was to describe a computational and analytical methodology for profiling small RNA by high-throughput sequencing. The datasets here were used to develop synthetic oligoribonucleotides as spike-in standards. We assessed the use of synthetic oligoribonucleotide standards as spike-in controls. These standards can be used to set an objective standard against which to compare samples. Standards were added to the total RNA (100 ug) in the following amounts: Std2 (TATATGCAAGTCCGGCCATAC) 0.01 pmol, Std3 (TAGCTAACGCATATCCGCATC) 0.1 pmol, Std6 (TGAAGCTGACATCGGTCATCC) 1.0 pmol.
Project description:The phi X 174 bacteriophage was first sequenced in 1977, and has since become the most widely used standard in molecular biology and next-generation sequencing. However, with the advent of affordable DNA synthesis and de novo gene design, we considered whether we could engineer a synthetic genome, termed SynX, specifically tailored for use as a universal molecular standard. The SynX genome encodes 21 synthetic genes that can be in vitro transcribed to generate matched mRNA controls, and in vitro translated to generate matched protein controls. This enables the use of SynX as a matched control to compare across genomic, transcriptomic and proteomic experiments. The synthetic genes provide qualitative controls that measure sequencing accuracy across k-mers, GC-rich and repeat sequences, as well as act as quantitative controls that measure sensitivity and quantitative accuracy. We show how the SynX genome can measure DNA sequencing, evaluate gene expression in RNA sequencing experiments, or quantify proteins in mass spectrometry. Unlike previous spike-in controls, the SynX DNA, RNA and protein controls can be independently and sustainably prepared by recipient laboratories using common molecular biology techniques, and widely shared as a universal molecular standard.
Project description:While the importance of random sequencing errors decreases at higher DNA or RNA sequencing depths, systematic sequencing errors (SSEs) dominate at high sequencing depths and can be difficult to distinguish from biological variants. These SSEs can cause base quality scores to underestimate the probability of error at certain genomic positions, resulting in false positive variant calls, particularly in mixtures such as samples with RNA editing, tumors, circulating tumor cells, bacteria, mitochondrial heteroplasmy, or pooled DNA. Most algorithms proposed for correction of SSEs require a training data set, which is typically either from a part of the data set being M-bM-^@M-^\recalibratedM-bM-^@M-^] (Genome Analysis ToolKit, or GATK) or from a separate data set with special characteristics (SysCall). Here, we combine the advantages of these approaches by adding synthetic RNA spike-in standards to human RNA, and use GATK to recalibrate base quality scores with reads mapped to the spike-in standards. Compared to conventional GATK recalibration that uses reads mapped to the genome, spike-ins improve the accuracy of Illumina base quality scores by a mean of 5 units, and by as much as 13 units M-BM- at CpG sites. In addition, since reads mapping to the genome are not used for recalibration, our method allows run-specific recalibration even for the many species without a comprehensive and accurate SNP database. We also use GATK with the spike-in standards to demonstrate that the Illumina RNA sequencing runs overestimate quality scores for AC, CC, GC, GG, and TC dinucleotides, while SOLiD has less dinucleotide SSEs but more SSEs for certain cycles. We conclude that using these DNA and RNA spike-in standards with GATK improves base quality score recalibration. Four human RNA samples with equimolar ERCC spike-in standards were sequenced on Illumina. Two human brain/liver/muscle RNA mixtures with dynamic range of ERCC spike-in standards were sequenced on SOLiD.
Project description:While the importance of random sequencing errors decreases at higher DNA or RNA sequencing depths, systematic sequencing errors (SSEs) dominate at high sequencing depths and can be difficult to distinguish from biological variants. These SSEs can cause base quality scores to underestimate the probability of error at certain genomic positions, resulting in false positive variant calls, particularly in mixtures such as samples with RNA editing, tumors, circulating tumor cells, bacteria, mitochondrial heteroplasmy, or pooled DNA. Most algorithms proposed for correction of SSEs require a training data set, which is typically either from a part of the data set being “recalibrated” (Genome Analysis ToolKit, or GATK) or from a separate data set with special characteristics (SysCall). Here, we combine the advantages of these approaches by adding synthetic RNA spike-in standards to human RNA, and use GATK to recalibrate base quality scores with reads mapped to the spike-in standards. Compared to conventional GATK recalibration that uses reads mapped to the genome, spike-ins improve the accuracy of Illumina base quality scores by a mean of 5 units, and by as much as 13 units at CpG sites. In addition, since reads mapping to the genome are not used for recalibration, our method allows run-specific recalibration even for the many species without a comprehensive and accurate SNP database. We also use GATK with the spike-in standards to demonstrate that the Illumina RNA sequencing runs overestimate quality scores for AC, CC, GC, GG, and TC dinucleotides, while SOLiD has less dinucleotide SSEs but more SSEs for certain cycles. We conclude that using these DNA and RNA spike-in standards with GATK improves base quality score recalibration.
2012-03-03 | GSE36217 | GEO
Project description:Sequencing of a synthetic spike-in control with complex variants