Dataset Information

Deep sequencing of synthetic RNA (complex molecular spike-in set)

ABSTRACT: Deep sequencing of synthetic RNA (complex molecular spike-in set)

PROVIDER: PRJEB50953 | ENA |

REPOSITORIES: ENA

ACCESS DATA

Dataset's files

Source:

			Action	DRS
	ERR8576920.fastq.gz	Fastqsanger.gz

Items per page:

1 - 1 of 1

Similar Datasets

Project description:While the importance of random sequencing errors decreases at higher DNA or RNA sequencing depths, systematic sequencing errors (SSEs) dominate at high sequencing depths and can be difficult to distinguish from biological variants. These SSEs can cause base quality scores to underestimate the probability of error at certain genomic positions, resulting in false positive variant calls, particularly in mixtures such as samples with RNA editing, tumors, circulating tumor cells, bacteria, mitochondrial heteroplasmy, or pooled DNA. Most algorithms proposed for correction of SSEs require a training data set, which is typically either from a part of the data set being M-bM-^@M-^\recalibratedM-bM-^@M-^] (Genome Analysis ToolKit, or GATK) or from a separate data set with special characteristics (SysCall). Here, we combine the advantages of these approaches by adding synthetic RNA spike-in standards to human RNA, and use GATK to recalibrate base quality scores with reads mapped to the spike-in standards. Compared to conventional GATK recalibration that uses reads mapped to the genome, spike-ins improve the accuracy of Illumina base quality scores by a mean of 5 units, and by as much as 13 units M-BM- at CpG sites. In addition, since reads mapping to the genome are not used for recalibration, our method allows run-specific recalibration even for the many species without a comprehensive and accurate SNP database. We also use GATK with the spike-in standards to demonstrate that the Illumina RNA sequencing runs overestimate quality scores for AC, CC, GC, GG, and TC dinucleotides, while SOLiD has less dinucleotide SSEs but more SSEs for certain cycles. We conclude that using these DNA and RNA spike-in standards with GATK improves base quality score recalibration. Four human RNA samples with equimolar ERCC spike-in standards were sequenced on Illumina. Two human brain/liver/muscle RNA mixtures with dynamic range of ERCC spike-in standards were sequenced on SOLiD.

Project description:Background.The cell-free methylated DNA immunoprecipitation-sequencing (cfMeDIP-seq) method, is adapted to work with low input DNA and with circulating cell-free DNA (cfDNA). This method allowsfor epigenetic profiling from liquid biopsy samples, providing potential information about tissue of origin. Similar to classical immunoprecipitation based enrichment protocols, interpretation requires a referenceor control to draw inference against a composite experimental baseline and against designed standards allowing for cross-experiment comparisons. Methods.To meet the need for a reference control in cfMeDIP-seqexperiments, we designed spike-in controlsand integrated the use of unique molecular index (UMI) to adjust for polymerase chain reaction (PCR)bias, and immunoprecipitation bias caused by the fragment length, G+C content, and CpG density ofthe DNA fragments. This enables for absolute quantification of methylated DNA in picomoles, while retaining epigenomic information that allows for sensitive, tissue-specific detection as well as comparableresults between different experiments. We designed 54 DNA fragments with combinations of methylationstatus (methylated and unmethylated), fragment length in base pair (bp) (80 bp,160 bp,320 bp), G+C content (35%,50%,65%), and fraction of CpGs within a fragment (1/80 bp,1/40 bp,1/20 bp). We checked spike-in control DNA sequence to ensure they had no cross alignment to the human genome and minimized formation of secondary structures to avoid issues with amplification. We carried outcfMeDIP-seq on either solely spike-in DNA fragments, spike-in DNA added to sheared HCT116 genomic DNA or spike-inDNA added tocfDNAfrom acute myeloid leukemia (AML) samples to assess technical and biological biases, determine optimal amount of spike-in DNA required for an experiment and to assess batch effects,respectively. Results. We show thatcfMeDIP-seqenriches for highly methylated regions, with less than 0.01%non-specific binding and preference to high G+C content and CpG fraction DNA fragments. The use of 0.01 ngof spike-in control DNA results in sufficient sequencing reads to adjust for variance due to fragment length,G+C content and CpG fraction without negatively impacting the number of sequencing reads generatedfor each sample. With known amount of each spike-in control, we generated a generalized linear modelthat can absolutely quantify molar amount from read counts while adjusting for fragment length, G+C content, and CpG fraction. Using our spike-in controls, we show that we can greatly mitigate batch effects,reducing batch associated variance in the data to ≤5%of the total variance. Conclusions.The incorporation of spike-in controls allows for easier interpretation of data generated from cfMeDIP-seq and MeDIP-seq experiments when compared to relative read count. Through the use of a generalized linear model tailored to each experiment, molar amount for each genomic region can becalculated, greatly mitigating both biological and technical biases in the data. We have created an Rpackage, spiky, to convert read counts to DNA picomoles while adjusting for fragment length, G+C contentand CpG fraction.

Dataset Information

Deep sequencing of synthetic RNA (complex molecular spike-in set)

Dataset's files

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets