Project description:BACKGROUND: The use of buccal swabs in clinical and scientific studies is a very popular method of collecting DNA, due to its non-invasive nature of collection. However, contamination of the DNA sample may interfere with analysis. FINDINGS: Here we report the finding of Streptococcus parasanguinis bacterial DNA contamination in human buccal DNA samples, which led to preferential amplification of bacterial sequence with PCR primers designed against human sequence. CONCLUSION: Contamination of buccal-derived DNA with bacterial DNA can be significant, and may influence downstream genetic analysis. One needs to be aware of possible bacterial contamination when interpreting abnormal findings following PCR amplification of buccal swab DNA samples.
Project description:Detecting and estimating DNA sample contamination are important steps to ensure high-quality genotype calls and reliable downstream analysis. Existing methods rely on population allele frequency information for accurate estimation of contamination rates. Correctly specifying population allele frequencies for each individual in early stage of sequence analysis is impractical or even impossible for large-scale sequencing centers that simultaneously process samples from multiple studies across diverse populations. On the other hand, incorrectly specified allele frequencies may result in substantial bias in estimated contamination rates. For example, we observed that existing methods often fail to identify 10% contaminated samples at a typical 3% contamination exclusion threshold when genetic ancestry is misspecified. Such an incomplete screening of contaminated samples substantially inflates the estimated rate of genotyping errors even in deeply sequenced genomes and exomes. We propose a robust statistical method that accurately estimates DNA contamination and is agnostic to genetic ancestry of the intended or contaminating sample. Our method integrates the estimation of genetic ancestry and DNA contamination in a unified likelihood framework by leveraging individual-specific allele frequencies projected from reference genotypes onto principal component coordinates. Our method can also be used for estimating genetic ancestries, similar to LASER or TRACE, but simultaneously accounting for potential contamination. We demonstrate that our method robustly estimates contamination rates and genetic ancestries across populations and contamination scenarios. We further demonstrate that, in the presence of contamination, genetic ancestry inference can be substantially biased with existing methods that ignore contamination, while our method corrects for such biases.
Project description:Array genotyping is a cost-effective and widely used tool that enables assessment of up to millions of genetic markers in hundreds of thousands of individuals. Genotyping array data are typically highly accurate but sensitive to mixing of DNA samples from multiple individuals before or during genotyping. Contaminated samples can lead to genotyping errors and consequently cause false positive signals or reduce power of association analyses. Here, we propose a new method to identify contaminated samples and the sources of contamination within a genotyping batch. Through analysis of array intensity and genotype data from intentionally mixed samples and 22,366 samples of the Michigan Genomics Initiative, an ongoing biobank-based study, we show that our method can reliably estimate contamination. We also show that identifying sources of contamination can implicate problematic sample processing steps and guide process improvements. Compared to existing methods, our approach can estimate the proportion of contaminating DNA more accurately, eliminate the need for external databases of allele frequencies, and provide contamination estimates that are more robust to the ancestral origin of the contaminating sample.
Project description:DNA sample contamination is a frequent problem in DNA sequencing studies and can result in genotyping errors and reduced power for association testing. We recently described methods to identify within-species DNA sample contamination based on sequencing read data, showed that our methods can reliably detect and estimate contamination levels as low as 1%, and suggested strategies to identify and remove contaminated samples from sequencing studies. Here we propose methods to model contamination during genotype calling as an alternative to removal of contaminated samples from further analyses. We compare our contamination-adjusted calls to calls that ignore contamination and to calls based on uncontaminated data. We demonstrate that, for moderate contamination levels (5%-20%), contamination-adjusted calls eliminate 48%-77% of the genotyping errors. For lower levels of contamination, our contamination correction methods produce genotypes nearly as accurate as those based on uncontaminated data. Our contamination correction methods are useful generally, but are particularly helpful for sample contamination levels from 2% to 20%.
Project description:Human lungs harbor a scarce microbial community, requiring to develop methods to enhance the recovery of nucleic acids from bacteria and fungi, leading to a more efficient analysis of the lung tissue microbiota. Here we describe five extraction protocols including pre-treatment, bead-beating and/or Phenol:Chloroform:Isoamyl alcohol steps, applied to lung tissue samples from autopsied individuals. The resulting total DNA yield and quality, bacterial and fungal DNA amount and the microbial community structure were analyzed by qPCR and Illumina sequencing of bacterial 16S rRNA and fungal ITS genes. Bioinformatic modeling revealed that a large part of microbiome from lung tissue is composed of microbial contaminants, although our controls clustered separately from biological samples. After removal of contaminant sequences, the effects of extraction protocols on the microbiota were assessed. The major differences among samples could be attributed to inter-individual variations rather than DNA extraction protocols. However, inclusion of the bead-beater and Phenol:Chloroform:Isoamyl alcohol steps resulted in changes in the relative abundance of some bacterial/fungal taxa. Furthermore, inclusion of a pre-treatment step increased microbial DNA concentration but not diversity and it may contribute to eliminate DNA fragments from dead microorganisms in lung tissue samples, making the microbial profile closer to the actual one.
Project description:BackgroundIn 2006, a novel gammaretrovirus, XMRV (xenotropic murine leukemia virus-related virus), was discovered in some prostate tumors. A more recent study indicated that this infectious retrovirus can be detected in 67% of patients suffering from chronic fatigue syndrome (CFS), but only very few healthy controls (4%). However, several groups have published to date that they could not identify XMRV RNA or DNA sequences in other cohorts of CFS patients, while another group detected murine leukemia virus (MLV)-like sequences in 87% of such patients, but only 7% of healthy controls. Since there is a high degree of similarity between XMRV and abundant endogenous MLV proviruses, it is important to distinguish contaminating mouse sequences from true infections.ResultsDNA from the peripheral blood of 112 CFS patients and 36 healthy controls was tested for XMRV with two different PCR assays. A TaqMan qPCR assay specific for XMRV pol sequences was able to detect viral DNA from 2 XMRV-infected cells (~ 10-12 pg DNA) in up to 5 ?g of human genomic DNA, but yielded negative results in the test of 600 ng genomic DNA from 100,000 peripheral blood cells of all samples tested. However, positive results were obtained with some of these samples, using a less specific nested PCR assay for a different XMRV sequence. DNA sequencing of the PCR products revealed a wide variety of virus-related sequences, some identical to those found in prostate cancer and CFS patients, others more closely related to known endogenous MLVs. However, all samples that tested positive for XMRV and/or MLV DNA were also positive for the highly abundant intracisternal A-type particle (IAP) long terminal repeat and most were positive for murine mitochondrial cytochrome oxidase sequences. No contamination was observed in any of the negative control samples, containing those with no DNA template, which were included in each assay.ConclusionsMouse cells contain upwards of 100 copies each of endogenous MLV DNA. Even much less than one cell's worth of DNA can yield a detectable product using highly sensitive PCR technology. It is, therefore, vital that contamination by mouse DNA be monitored with adequately sensitive assays in all samples tested.
Project description:The global spread and continued evolution of SARS-CoV-2 has driven an unprecedented surge in viral genomic surveillance. Amplicon-based sequencing methods provide a sensitive, low-cost and rapid approach but suffer a high potential for contamination, which can undermine laboratory processes and results. This challenge will increase with the expanding global production of sequences across a variety of laboratories for epidemiological and clinical interpretation, as well as for genomic surveillance of emerging diseases in future outbreaks. We present SDSI + AmpSeq, an approach that uses 96 synthetic DNA spike-ins (SDSIs) to track samples and detect inter-sample contamination throughout the sequencing workflow. We apply SDSIs to the ARTIC Consortium's amplicon design, demonstrate their utility and efficiency in a real-time investigation of a suspected hospital cluster of SARS-CoV-2 cases and validate them across 6,676 diagnostic samples at multiple laboratories. We establish that SDSI + AmpSeq provides increased confidence in genomic data by detecting and correcting for relatively common, yet previously unobserved modes of error, including spillover and sample swaps, without impacting genome recovery.
Project description:Eukaryotic Argonautes bind small RNAs and use them as guides to find complementary RNA targets and induce gene silencing. Though homologs of eukaryotic Argonautes are present in many bacteria and archaea, their small RNA partners and functions are unknown. We found that the Argonaute of Rhodobacter sphaeroides (RsAgo) associates with 15-19 nt RNAs that correspond to the majority of transcripts. RsAgo also binds single-stranded 22-24 nt DNA molecules that are complementary to the small RNAs and enriched in sequences derived from exogenous plasmids as well as genome-encoded foreign nucleic acids such as transposons and phage genes. Expression of RsAgo in the heterologous E. coli system leads to formation of plasmid-derived small RNA and DNA and plasmid degradation. In a R. sphaeroides mutant lacking RsAgo, expression of plasmid-encoded genes is elevated. Our results indicate that RNAi-related processes found in eukaryotes are also conserved in bacteria and target foreign nucleic acids.
Project description:DNA sample contamination is a serious problem in DNA sequencing studies and may result in systematic genotype misclassification and false positive associations. Although methods exist to detect and filter out cross-species contamination, few methods to detect within-species sample contamination are available. In this paper, we describe methods to identify within-species DNA sample contamination based on (1) a combination of sequencing reads and array-based genotype data, (2) sequence reads alone, and (3) array-based genotype data alone. Analysis of sequencing reads allows contamination detection after sequence data is generated but prior to variant calling; analysis of array-based genotype data allows contamination detection prior to generation of costly sequence data. Through a combination of analysis of in silico and experimentally contaminated samples, we show that our methods can reliably detect and estimate levels of contamination as low as 1%. We evaluate the impact of DNA contamination on genotype accuracy and propose effective strategies to screen for and prevent DNA contamination in sequencing studies.