Dataset Information

Cleaning Genotype Data from Diversity Outbred Mice.

ABSTRACT: Data cleaning is an important first step in most statistical analyses, including efforts to map the genetic loci that contribute to variation in quantitative traits. Here we illustrate approaches to quality control and cleaning of array-based genotyping data for multiparent populations (experimental crosses derived from more than two founder strains), using MegaMUGA array data from a set of 291 Diversity Outbred (DO) mice. Our approach employs data visualizations that can reveal problems at the level of individual mice or with individual SNP markers. We find that the proportion of missing genotypes for each mouse is an effective indicator of sample quality. We use microarray probe intensities for SNPs on the X and Y chromosomes to confirm the sex of each mouse, and we use the proportion of matching SNP genotypes between pairs of mice to detect sample duplicates. We use a hidden Markov model (HMM) reconstruction of the founder haplotype mosaic across each mouse genome to estimate the number of crossovers and to identify potential genotyping errors. To evaluate marker quality, we find that missing data and genotyping error rates are the most effective diagnostics. We also examine the SNP genotype frequencies with markers grouped according to their minor allele frequency in the founder strains. For markers with high apparent error rates, a scatterplot of the allele-specific probe intensities can reveal the underlying cause of incorrect genotype calls. The decision to include or exclude low-quality samples can have a significant impact on the mapping results for a given study. We find that the impact of low-quality markers on a given study is often minimal, but reporting problematic markers can improve the utility of the genotyping array across many studies.

SUBMITTER: Broman KW

PROVIDER: S-EPMC6505173 | biostudies-literature | 2019 May

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Cleaning Genotype Data from Diversity Outbred Mice.

Broman Karl W KW Gatti Daniel M DM Svenson Karen L KL Sen Śaunak Ś Churchill Gary A GA

G3 (Bethesda, Md.) 20190507 5

Data cleaning is an important first step in most statistical analyses, including efforts to map the genetic loci that contribute to variation in quantitative traits. Here we illustrate approaches to quality control and cleaning of array-based genotyping data for multiparent populations (experimental crosses derived from more than two founder strains), using MegaMUGA array data from a set of 291 Diversity Outbred (DO) mice. Our approach employs data visualizations that can reveal problems at the ...[more]

PMID: 30877082

Similar Datasets

Project description:More humans have died of tuberculosis (TB) than any other infectious disease and millions still die each year. Experts advocate for blood-based, serum protein biomarkers to help diagnose TB, which afflicts millions of people in high-burden countries. However, the protein biomarker pipeline is small. Here, we used the Diversity Outbred (DO) mouse population to address this gap, identifying five protein biomarker candidates. One protein biomarker, serum CXCL1, met the World Health Organization's Targeted Product Profile for a triage test to diagnose active TB from latent M.tb infection (LTBI), non-TB lung disease, and normal sera in HIV-negative, adults from South Africa and Vietnam. To find the biomarker candidates, we quantified seven immune cytokines and four inflammatory proteins corresponding to highly expressed genes unique to progressor DO mice. Next, we applied statistical and machine learning methods to the data, i.e., 11 proteins in lungs from 453 infected and 29 non-infected mice. After searching all combinations of five algorithms and 239 protein subsets, validating, and testing the findings on independent data, two combinations accurately diagnosed progressor DO mice: Logistic Regression using MMP8; and Gradient Tree Boosting using a panel of 4: CXCL1, CXCL2, TNF, IL-10. Of those five protein biomarker candidates, two (MMP8 and CXCL1) were crucial for classifying DO mice; were above the limit of detection in most human serum samples; and had not been widely assessed for diagnostic performance in humans before. In patient sera, CXCL1 exceeded the triage diagnostic test criteria (>90% sensitivity; >70% specificity), while MMP8 did not. Using Area Under the Curve analyses, CXCL1 averaged 94.5% sensitivity and 88.8% specificity for active pulmonary TB (ATB) vs LTBI; 90.9% sensitivity and 71.4% specificity for ATB vs non-TB; and 100.0% sensitivity and 98.4% specificity for ATB vs normal sera. Our findings overall show that the DO mouse population can discover diagnostic-quality, serum protein biomarkers of human TB.

Dataset Information

Cleaning Genotype Data from Diversity Outbred Mice.

Publications

Cleaning Genotype Data from Diversity Outbred Mice.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets