Unknown,Transcriptomics,Genomics,Proteomics

Dataset Information

Comparison of systematic sequencing errors using spike-in standards

ABSTRACT: While the importance of random sequencing errors decreases at higher DNA or RNA sequencing depths, systematic sequencing errors (SSEs) dominate at high sequencing depths and can be difficult to distinguish from biological variants. These SSEs can cause base quality scores to underestimate the probability of error at certain genomic positions, resulting in false positive variant calls, particularly in mixtures such as samples with RNA editing, tumors, circulating tumor cells, bacteria, mitochondrial heteroplasmy, or pooled DNA. Most algorithms proposed for correction of SSEs require a training data set, which is typically either from a part of the data set being M-bM-^@M-^\recalibratedM-bM-^@M-^] (Genome Analysis ToolKit, or GATK) or from a separate data set with special characteristics (SysCall). Here, we combine the advantages of these approaches by adding synthetic RNA spike-in standards to human RNA, and use GATK to recalibrate base quality scores with reads mapped to the spike-in standards. Compared to conventional GATK recalibration that uses reads mapped to the genome, spike-ins improve the accuracy of Illumina base quality scores by a mean of 5 units, and by as much as 13 units M-BM- at CpG sites. In addition, since reads mapping to the genome are not used for recalibration, our method allows run-specific recalibration even for the many species without a comprehensive and accurate SNP database. We also use GATK with the spike-in standards to demonstrate that the Illumina RNA sequencing runs overestimate quality scores for AC, CC, GC, GG, and TC dinucleotides, while SOLiD has less dinucleotide SSEs but more SSEs for certain cycles. We conclude that using these DNA and RNA spike-in standards with GATK improves base quality score recalibration. Four human RNA samples with equimolar ERCC spike-in standards were sequenced on Illumina. Two human brain/liver/muscle RNA mixtures with dynamic range of ERCC spike-in standards were sequenced on SOLiD.

ORGANISM(S): Homo sapiens

SUBMITTER: Justin Zook

PROVIDER: E-GEOD-36217 | biostudies-arrayexpress |

REPOSITORIES: biostudies-arrayexpress

ACCESS DATA

Publications

Synthetic spike-in standards improve run-specific systematic error analysis for DNA and RNA sequencing.

Zook Justin M JM Samarov Daniel D McDaniel Jennifer J Sen Shurjo K SK Salit Marc M

PloS one 20120731 7

While the importance of random sequencing errors decreases at higher DNA or RNA sequencing depths, systematic sequencing errors (SSEs) dominate at high sequencing depths and can be difficult to distinguish from biological variants. These SSEs can cause base quality scores to underestimate the probability of error at certain genomic positions, resulting in false positive variant calls, particularly in mixtures such as samples with RNA editing, tumors, circulating tumor cells, bacteria, mitochondr ...[more]

PMID: 22859977

Dataset Information

Comparison of systematic sequencing errors using spike-in standards

Publications

Synthetic spike-in standards improve run-specific systematic error analysis for DNA and RNA sequencing.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

Correction of human phospholamban R14del mutation associated with cardiomyopathy using targeted nucleases and combination therapy [Exome-Seq]
2015-02-09 | E-GEOD-65762 | biostudies-arrayexpress

Whole exome sequencing and transcription profiling of a patient cohort of oral cavity squamous cell carcinomas
2018-09-28 | E-MTAB-6448 | biostudies-arrayexpress

Global Reorganization of Chromatin Architecture during Embronic Stem Cell Differentiation
2015-02-18 | E-GEOD-52457 | biostudies-arrayexpress

Comparative Methylome Analyses Identify Epigenetic Regulatory Loci of Human Brain Evolution [SNP]
2016-08-20 | E-GEOD-85867 | biostudies-arrayexpress

Whole exome sequencing of human MDS/AML OCI-M2 and patient-derived bone marrow cell lines treated with 5-azacytidine
2022-05-02 | E-MTAB-11172 | biostudies-arrayexpress

Circulating Tumor Cell(CTC) Isolation and Genonic varation Detection from Liquid Samples of Lung Cancer
2017-02-17 | E-MTAB-4948 | biostudies-arrayexpress

Transcriptomic characterization of the human cell cycle in individual unsynchronized cells
2017-11-09 | E-MTAB-6142 | biostudies-arrayexpress

Sequence-Targeted Nucleosome Sliding in vivo - Transcription Profiling
2016-03-12 | E-GEOD-72571 | biostudies-arrayexpress

Single-cell RNA Seq of hematopoietic stem and progenitor cells
2015-08-31 | E-GEOD-64002 | biostudies-arrayexpress

An integrated transcriptome and expressed variant analysis of sepsis survival and death
2015-01-01 | E-GEOD-63042 | biostudies-arrayexpress