Dataset Information

SMaSH: Sample matching using SNPs in humans.

ABSTRACT: BACKGROUND:Inadvertent sample swaps are a real threat to data quality in any medium to large scale omics studies. While matches between samples from the same individual can in principle be identified from a few well characterized single nucleotide polymorphisms (SNPs), omics data types often only provide low to moderate coverage, thus requiring integration of evidence from a large number of SNPs to determine if two samples derive from the same individual or not. METHODS:We select about six thousand SNPs in the human genome and develop a Bayesian framework that is able to robustly identify sample matches between next generation sequencing data sets. RESULTS:We validate our approach on a variety of data sets. Most importantly, we show that our approach can establish identity between different omics data types such as Exome, RNA-Seq, and MethylCap-Seq. We demonstrate how identity detection degrades with sample quality and read coverage, but show that twenty million reads of a fairly low quality RNA-Seq sample are still sufficient for reliable sample identification. CONCLUSION:Our tool, SMASH, is able to identify sample mismatches in next generation sequencing data sets between different sequencing modalities and for low quality sequencing data.

SUBMITTER: Westphal M

PROVIDER: S-EPMC6936078 | biostudies-literature | 2019 Dec

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

SMaSH: Sample matching using SNPs in humans.

Westphal Maximillian M Frankhouser David D Sonzone Carmine C Shields Peter G PG Yan Pearlly P Bundschuh Ralf R

BMC genomics 20191230 Suppl 12

<h4>Background</h4>Inadvertent sample swaps are a real threat to data quality in any medium to large scale omics studies. While matches between samples from the same individual can in principle be identified from a few well characterized single nucleotide polymorphisms (SNPs), omics data types often only provide low to moderate coverage, thus requiring integration of evidence from a large number of SNPs to determine if two samples derive from the same individual or not.<h4>Methods</h4>We select ...[more]

PMID: 31888490

Similar Datasets

Project description:BackgroundHistological assessment of skeletal muscle tissue is commonly applied to many areas of skeletal muscle physiological research. Histological parameters including fiber distribution, fiber type, centrally nucleated fibers, and capillary density are all frequently quantified measures of skeletal muscle. These parameters reflect functional properties of muscle and undergo adaptation in many muscle diseases and injuries. While standard operating procedures have been developed to guide analysis of many of these parameters, the software to freely, efficiently, and consistently analyze them is not readily available. In order to provide this service to the muscle research community we developed an open source MATLAB script to analyze immunofluorescent muscle sections incorporating user controls for muscle histological analysis.ResultsThe software consists of multiple functions designed to provide tools for the analysis selected. Initial segmentation and fiber filter functions segment the image and remove non-fiber elements based on user-defined parameters to create a fiber mask. Establishing parameters set by the user, the software outputs data on fiber size and type, centrally nucleated fibers, and other structures. These functions were evaluated on stained soleus muscle sections from 1-year-old wild-type and mdx mice, a model of Duchenne muscular dystrophy. In accordance with previously published data, fiber size was not different between groups, but mdx muscles had much higher fiber size variability. The mdx muscle had a significantly greater proportion of type I fibers, but type I fibers did not change in size relative to type II fibers. Centrally nucleated fibers were highly prevalent in mdx muscle and were significantly larger than peripherally nucleated fibers.ConclusionsThe MATLAB code described and provided along with this manuscript is designed for image processing of skeletal muscle immunofluorescent histological sections. The program allows for semi-automated fiber detection along with user correction. The output of the code provides data in accordance with established standards of practice. The results of the program have been validated using a small set of wild-type and mdx muscle sections. This program is the first freely available and open source image processing program designed to automate analysis of skeletal muscle histological sections.

Project description:BackgroundHigh-throughput methods that ascribe a cellular or physiological function for each gene product are useful to understand the roles of genes that have not been extensively characterized by molecular or genetic approaches. One method to infer gene function is "guilt-by-association", in which the expression pattern of a poorly characterized gene is shown to co-vary with the expression of better-characterized genes. The function of the poorly characterized gene is inferred from the known function(s) of the well-described genes. For example, genes co-expressed with transcripts that vary during the cell cycle, development, environmental stresses, and with oncogenesis have been implicated in those processes.FindingsWhile examining the expression characteristics of several poorly characterized genes, we noted that we could associate each of the genes with a cellular phenotype by correlating individual gene expression changes with gene set enrichment scores from individual samples. We evaluated the effectiveness of this approach using a modest sized gene expression data set (expO) and a compendium of gene expression phenotypes (MSigDBv3.0). We found the transcripts that correlated best with enrichment in mitochondrial and lysosomal gene sets were mostly related to those processes (89/100 and 44/50, respectively). The reciprocal evaluation, ranking gene sets according to correlation of enrichment with an individual gene's expression, also reflected known associations for prominent genes in the biomedical literature (16/19). In evaluating the model, we also found that 4% of the genome encodes proteins that are associated with small molecule and small peptide signal transduction gene sets, implicating a large number of genes in both internal and external environmental sensing.ConclusionsOur results show that this approach is useful to infer functions of disparate sets of genes. This method mirrors the biological experimental approaches used by others to associate individual genes with defined gene expression changes. Moreover, the approach can be used beyond discovering genes related to a cellular process to discover meaningful expression phenotypes from a compendium that are associated with a given gene. The effectiveness, versatility, and breadth of this approach make possible its application in a variety of contexts and with a variety of downstream analyses.

Project description:Collection of accurate and representative data from agricultural fields is required for efficient crop management. Since growers have limited available resources, there is a need for advanced methods to select representative points within a field in order to best satisfy sampling or sensing objectives. The main purpose of this work was to develop a data-driven method for selecting locations across an agricultural field given observations of some covariates at every point in the field. These chosen locations should be representative of the distribution of the covariates in the entire population and represent the spatial variability in the field. They can then be used to sample an unknown target feature whose sampling is expensive and cannot be realistically done at the population scale. An algorithm for determining these optimal sampling locations, namely the multifunctional matching (MFM) criterion, was based on matching of moments (functionals) between sample and population. The selected functionals in this study were standard deviation, mean, and Kendall's tau. An additional algorithm defined the minimal number of observations that could represent the population according to a desired level of accuracy. The MFM was applied to datasets from two agricultural plots: a vineyard and a peach orchard. The data from the plots included measured values of slope, topographic wetness index, normalized difference vegetation index, and apparent soil electrical conductivity. The MFM algorithm selected the number of sampling points according to a representation accuracy of 90% and determined the optimal location of these points. The algorithm was validated against values of vine or tree water status measured as crop water stress index (CWSI). Algorithm performance was then compared to two other sampling methods: the conditioned Latin hypercube sampling (cLHS) model and a uniform random sample with spatial constraints. Comparison among sampling methods was based on measures of similarity between the target variable population distribution and the distribution of the selected sample. MFM represented CWSI distribution better than the cLHS and the uniform random sampling, and the selected locations showed smaller deviations from the mean and standard deviation of the entire population. The MFM functioned better in the vineyard, where spatial variability was larger than in the orchard. In both plots, the spatial pattern of the selected samples captured the spatial variability of CWSI. MFM can be adjusted and applied using other moments/functionals and may be adopted by other disciplines, particularly in cases where small sample sizes are desired.

Project description:Gene expression patterns in the brain are strongly influenced by the severity of physiological stress at death. This agonal effect, if not well controlled, can lead to spurious findings in case-control comparisons. While many recent studies match samples by tissue pH and clinically recorded agonal conditions, we found that these commonly used indicators were sometimes at odds with observed stress-related patterns of gene expression, and that matching by these criteria still sometimes results in identifying differences between cases and controls that are primarily driven by residual agonal effects. This problem is analogous to the one in genetic studies, where race and ethnicity are often imprecise proxies for complex environmental and genetic factors. We developed an Agonal Stress Rating (ASR) system that evaluates each sampleâs degree of stress based on gene expression data, and used ASRs in post hoc sample matching or covariate analysis. While we found that gene expression patterns are generally correlated across different regions of the same brain, we also found strong region-region differences in empirical ASRs in many subjects that are likely due to inter-individual variabilities in local structure or function, resulting in region-specific vulnerability to agonal stress. Variation of agonal stress from one region of the brain to another differs between individuals, revealing a new level of complexity for gene expression studies of brain tissues. The Agonal Stress Ratings provide a direct assessment of the regulatory responses to agonal stress in individual samples, and allow a strong control of this important confounder. Our strategy is analogous to sample matching by inferred ancestral proportions in genetic association studies to control subtle confounding by ancestry. Keywords: Agonal Stress Rating comparison We examined the relationship between the Agonal Stress Ratings (ASRs) and conventional pre hoc indicators such as pH and clinically derived Agonal Factor Scores (AFS), compared the stress ratings across six brain regions in up t0 126 samples, and assessed the performance of different sample matching strategies.

Dataset Information

SMaSH: Sample matching using SNPs in humans.

Publications

SMaSH: Sample matching using SNPs in humans.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets