Project description:A highly complex set of 264 molecular spikes, based on 11 unique spike sequences spanning different lengths (570 to 3070 nts) and GC contents (40-60%) was designed. In order to be able to precisely evaluate quantification over different expression levels, transcript lengths and GC contents, barcodes of 7 nucleotides in 2-fold abundance steps were cloned into each spike sequence (12 steps in duplicates; 24 barcodes per sequence) creating a standard curve for each spike sequence. To determine the molecular abundance of each of the 264 molecular spike-ins (i.e., the ‘ground truth’), we performed an exhaustive sequencing across the spike barcodes and spUMIs and determined the total complexity in the pool to be 76 million unique molecules
Project description:While the importance of random sequencing errors decreases at higher DNA or RNA sequencing depths, systematic sequencing errors (SSEs) dominate at high sequencing depths and can be difficult to distinguish from biological variants. These SSEs can cause base quality scores to underestimate the probability of error at certain genomic positions, resulting in false positive variant calls, particularly in mixtures such as samples with RNA editing, tumors, circulating tumor cells, bacteria, mitochondrial heteroplasmy, or pooled DNA. Most algorithms proposed for correction of SSEs require a training data set, which is typically either from a part of the data set being “recalibrated” (Genome Analysis ToolKit, or GATK) or from a separate data set with special characteristics (SysCall). Here, we combine the advantages of these approaches by adding synthetic RNA spike-in standards to human RNA, and use GATK to recalibrate base quality scores with reads mapped to the spike-in standards. Compared to conventional GATK recalibration that uses reads mapped to the genome, spike-ins improve the accuracy of Illumina base quality scores by a mean of 5 units, and by as much as 13 units at CpG sites. In addition, since reads mapping to the genome are not used for recalibration, our method allows run-specific recalibration even for the many species without a comprehensive and accurate SNP database. We also use GATK with the spike-in standards to demonstrate that the Illumina RNA sequencing runs overestimate quality scores for AC, CC, GC, GG, and TC dinucleotides, while SOLiD has less dinucleotide SSEs but more SSEs for certain cycles. We conclude that using these DNA and RNA spike-in standards with GATK improves base quality score recalibration.
Project description:While the importance of random sequencing errors decreases at higher DNA or RNA sequencing depths, systematic sequencing errors (SSEs) dominate at high sequencing depths and can be difficult to distinguish from biological variants. These SSEs can cause base quality scores to underestimate the probability of error at certain genomic positions, resulting in false positive variant calls, particularly in mixtures such as samples with RNA editing, tumors, circulating tumor cells, bacteria, mitochondrial heteroplasmy, or pooled DNA. Most algorithms proposed for correction of SSEs require a training data set, which is typically either from a part of the data set being M-bM-^@M-^\recalibratedM-bM-^@M-^] (Genome Analysis ToolKit, or GATK) or from a separate data set with special characteristics (SysCall). Here, we combine the advantages of these approaches by adding synthetic RNA spike-in standards to human RNA, and use GATK to recalibrate base quality scores with reads mapped to the spike-in standards. Compared to conventional GATK recalibration that uses reads mapped to the genome, spike-ins improve the accuracy of Illumina base quality scores by a mean of 5 units, and by as much as 13 units M-BM- at CpG sites. In addition, since reads mapping to the genome are not used for recalibration, our method allows run-specific recalibration even for the many species without a comprehensive and accurate SNP database. We also use GATK with the spike-in standards to demonstrate that the Illumina RNA sequencing runs overestimate quality scores for AC, CC, GC, GG, and TC dinucleotides, while SOLiD has less dinucleotide SSEs but more SSEs for certain cycles. We conclude that using these DNA and RNA spike-in standards with GATK improves base quality score recalibration. Four human RNA samples with equimolar ERCC spike-in standards were sequenced on Illumina. Two human brain/liver/muscle RNA mixtures with dynamic range of ERCC spike-in standards were sequenced on SOLiD.
Project description:A benchmark set of bottom-up proteomics data for training deep learning networks. It has data from 51 organisms and includes nearly 1 million peptides.
Project description:Background.The cell-free methylated DNA immunoprecipitation-sequencing (cfMeDIP-seq) method, is adapted to work with low input DNA and with circulating cell-free DNA (cfDNA). This method allowsfor epigenetic profiling from liquid biopsy samples, providing potential information about tissue of origin. Similar to classical immunoprecipitation based enrichment protocols, interpretation requires a referenceor control to draw inference against a composite experimental baseline and against designed standards allowing for cross-experiment comparisons. Methods.To meet the need for a reference control in cfMeDIP-seqexperiments, we designed spike-in controlsand integrated the use of unique molecular index (UMI) to adjust for polymerase chain reaction (PCR)bias, and immunoprecipitation bias caused by the fragment length, G+C content, and CpG density ofthe DNA fragments. This enables for absolute quantification of methylated DNA in picomoles, while retaining epigenomic information that allows for sensitive, tissue-specific detection as well as comparableresults between different experiments. We designed 54 DNA fragments with combinations of methylationstatus (methylated and unmethylated), fragment length in base pair (bp) (80 bp,160 bp,320 bp), G+C content (35%,50%,65%), and fraction of CpGs within a fragment (1/80 bp,1/40 bp,1/20 bp). We checked spike-in control DNA sequence to ensure they had no cross alignment to the human genome and minimized formation of secondary structures to avoid issues with amplification. We carried outcfMeDIP-seq on either solely spike-in DNA fragments, spike-in DNA added to sheared HCT116 genomic DNA or spike-inDNA added tocfDNAfrom acute myeloid leukemia (AML) samples to assess technical and biological biases, determine optimal amount of spike-in DNA required for an experiment and to assess batch effects,respectively. Results. We show thatcfMeDIP-seqenriches for highly methylated regions, with less than 0.01%non-specific binding and preference to high G+C content and CpG fraction DNA fragments. The use of 0.01 ngof spike-in control DNA results in sufficient sequencing reads to adjust for variance due to fragment length,G+C content and CpG fraction without negatively impacting the number of sequencing reads generatedfor each sample. With known amount of each spike-in control, we generated a generalized linear modelthat can absolutely quantify molar amount from read counts while adjusting for fragment length, G+C content, and CpG fraction. Using our spike-in controls, we show that we can greatly mitigate batch effects,reducing batch associated variance in the data to ≤5%of the total variance. Conclusions.The incorporation of spike-in controls allows for easier interpretation of data generated from cfMeDIP-seq and MeDIP-seq experiments when compared to relative read count. Through the use of a generalized linear model tailored to each experiment, molar amount for each genomic region can becalculated, greatly mitigating both biological and technical biases in the data. We have created an Rpackage, spiky, to convert read counts to DNA picomoles while adjusting for fragment length, G+C contentand CpG fraction.
Project description:Flexible regulation of gene expression is essential and highly sought for synthetic biology and biotechnology. Designing regulators with specific functions remains a challenge due to the limited understanding of specific regulatory mechanisms. We design and synthesize 23,640 B-cell-specific promoters, following the design-build-test-learn pipeline in synthetic biology. Synthetic promoters exhibit B-cell-specific expression and lead to diverse expression patterns in B-cells. By conducting MPRA testing, we uncovered the factors that influence promoter strength, including core motifs and motif syntax, which shape B-cell-specific promoter strength. Finally, we developed a deep leaning model capable of predicting promoter activity directly from the sequence, and to predict promoter activity for 26,193 variants identified in the global population, indicating that polymorphisms in IgV gene promoters can influence gene expression. Our work helps to decipher the regulatory code in immunoglobulin genes and offers thousands of non-repetitive promoter elements for B-cell engineering.
Project description:We designed spike-in controls for cfMeDIP-seq experiments. These spike-in controls mitigate batch effects and allow for absolute quantification of cell-free DNA.