Dataset Information

A statistical framework for detecting mislabeled and contaminated samples using shallow-depth sequence data.

ABSTRACT:

Background

Researchers typically sequence a given individual multiple times, either re-sequencing the same DNA sample (technical replication) or sequencing different DNA samples collected on the same individual (biological replication) or both. Before merging the data from these replicate sequence runs, it is important to verify that no errors, such as DNA contamination or mix-ups, occurred during the data collection pipeline. Methods to detect such errors exist but are often ad hoc, cannot handle missing data and several require phased data. Because they require some combination of genotype calling, imputation, and haplotype phasing, these methods are unsuitable for error detection in low- to moderate-depth sequence data where such tasks are difficult to perform accurately. Additionally, because most existing methods employ a pairwise-comparison approach for error detection rather than joint analysis of the putative replicates, results may be difficult to interpret.

Results

We introduce a new method for error detection suitable for shallow-, moderate-, and high-depth sequence data. Using Bayes Theorem, we calculate the posterior probability distribution over the set of relations describing the putative replicates and infer which of the samples originated from an identical genotypic source.

Conclusions

Our method addresses key limitations of existing approaches and produced highly accurate results in simulation experiments. Our method is implemented as an R package called BIGRED (Bayes Inferred Genotype Replicate Error Detector), which is freely available for download: https://github.com/ac2278/BIGRED .

SUBMITTER: Chan AW

PROVIDER: S-EPMC6292093 | biostudies-literature | 2018 Dec

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

A statistical framework for detecting mislabeled and contaminated samples using shallow-depth sequence data.

Chan Ariel W AW Williams Amy L AL Jannink Jean-Luc JL

BMC bioinformatics 20181212 1

<h4>Background</h4>Researchers typically sequence a given individual multiple times, either re-sequencing the same DNA sample (technical replication) or sequencing different DNA samples collected on the same individual (biological replication) or both. Before merging the data from these replicate sequence runs, it is important to verify that no errors, such as DNA contamination or mix-ups, occurred during the data collection pipeline. Methods to detect such errors exist but are often ad hoc, can ...[more]

PMID: 30541436

Dataset Information

A statistical framework for detecting mislabeled and contaminated samples using shallow-depth sequence data.

Background

Results

Conclusions

Publications

A statistical framework for detecting mislabeled and contaminated samples using shallow-depth sequence data.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

reGenotyper: Detecting mislabeled samples in genetic data.
| S-EPMC5305221 | biostudies-literature

Identifying mislabeled and contaminated DNA methylation microarray data: an extended quality control toolset with examples from GEO.
| S-EPMC5984806 | biostudies-literature

A statistical method for detecting genomic aberrations in heterogeneous tumour samples from single nucleotide polymorphism genotyping data
2010-08-25 | GSE23785 | GEO

Statistical methods for detecting differentially abundant features in clinical metagenomic samples.
| S-EPMC2661018 | biostudies-literature

PB-DiffHiC: a statistical framework for detecting differential chromatin interactions from high resolution pseudo-bulk Hi-C data.
| S-EPMC12512566 | biostudies-literature

A statistical framework for detecting therapy-induced resistance from drug screens.
| S-EPMC12328638 | biostudies-literature

Hypatia: a statistical framework for single-cell RNA isoform data analysis
2026-03-05 | GSE310974 | GEO

Missing data: A statistical framework for practice.
| S-EPMC7615108 | biostudies-literature

Statistical dynamical model to predict extreme events and anomalous features in shallow water waves with abrupt depth change.
| S-EPMC6410832 | biostudies-literature

A statistical method for detecting genomic aberrations in heterogeneous tumour samples from single nucleotide polymorphism genotyping data
2010-08-25 | E-GEOD-23785 | biostudies-arrayexpress