Dataset Information

CLARITY: comparing heterogeneous data using dissimilarity.

ABSTRACT: Integrating datasets from different disciplines is hard because the data are often qualitatively different in meaning, scale and reliability. When two datasets describe the same entities, many scientific questions can be phrased around whether the (dis)similarities between entities are conserved across such different data. Our method, CLARITY, quantifies consistency across datasets, identifies where inconsistencies arise and aids in their interpretation. We illustrate this using three diverse comparisons: gene methylation versus expression, evolution of language sounds versus word use, and country-level economic metrics versus cultural beliefs. The non-parametric approach is robust to noise and differences in scaling, and makes only weak assumptions about how the data were generated. It operates by decomposing similarities into two components: a 'structural' component analogous to a clustering, and an underlying 'relationship' between those structures. This allows a 'structural comparison' between two similarity matrices using their predictability from 'structure'. Significance is assessed with the help of re-sampling appropriate for each dataset. The software, CLARITY, is available as an R package from github.com/danjlawson/CLARITY.

SUBMITTER: Lawson DJ

PROVIDER: S-EPMC8652278 | biostudies-literature | 2021 Dec

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

CLARITY: comparing heterogeneous data using dissimilarity.

Lawson Daniel J DJ Solanki Vinesh V Yanovich Igor I Dellert Johannes J Ruck Damian D Endicott Phillip P

Royal Society open science 20211208 12

Integrating datasets from different disciplines is hard because the data are often qualitatively different in meaning, scale and reliability. When two datasets describe the same entities, many scientific questions can be phrased around whether the (dis)similarities between entities are conserved across such different data. Our method, CLARITY, quantifies consistency across datasets, identifies where inconsistencies arise and aids in their interpretation. We illustrate this using three diverse co ...[more]

PMID: 34909208

Similar Datasets

Project description:BackgroundMetagenomics sequencing provides deep insights into microbial communities. To investigate their taxonomic structure, binning assembled contigs into discrete clusters is critical. Many binning algorithms have been developed, but their performance is not always satisfactory, especially for complex microbial communities, calling for further development.ResultsAccording to previous studies, relative sequence compositions are similar across different regions of the same genome, but they differ between distinct genomes. Generally, current tools have used the normalized frequency of k-tuples directly, but this represents an absolute, not relative, sequence composition. Therefore, we attempted to model contigs using relative k-tuple composition, followed by measuring dissimilarity between contigs using [Formula: see text]. The [Formula: see text] was designed to measure the dissimilarity between two long sequences or Next-Generation Sequencing data with the Markov models of the background genomes. This method was effective in revealing group and gradient relationships between genomes, metagenomes and metatranscriptomes. With many binning tools available, we do not try to bin contigs from scratch. Instead, we developed [Formula: see text] to adjust contigs among bins based on the output of existing binning tools for a single metagenomic sample. The tool is taxonomy-free and depends only on k-tuples. To evaluate the performance of [Formula: see text], five widely used binning tools with different strategies of sequence composition or the hybrid of sequence composition and abundance were selected to bin six synthetic and real datasets, after which [Formula: see text] was applied to adjust the binning results. Our experiments showed that [Formula: see text] consistently achieves the best performance with tuple length k = 6 under the independent identically distributed (i.i.d.) background model. Using the metrics of recall, precision and ARI (Adjusted Rand Index), [Formula: see text] improves the binning performance in 28 out of 30 testing experiments (6 datasets with 5 binning tools). The [Formula: see text] is available at https://github.com/kunWangkun/d2SBin .ConclusionsExperiments showed that [Formula: see text] accurately measures the dissimilarity between contigs of metagenomic reads and that relative sequence composition is more reasonable to bin the contigs. The [Formula: see text] can be applied to any existing contig-binning tools for single metagenomic samples to obtain better binning results.

Project description:Sample- and gene-based hierarchical cluster analyses have been widely adopted as tools for exploring gene expression data in high-throughput experiments. Gene expression values (read counts) generated by RNA sequencing technology (RNA-seq) are discrete variables with special statistical properties, such as over-dispersion and right-skewness. Additionally, read counts are subject to technology artifacts as differences in sequencing depth. This possesses a challenge to finding distance measures suitable for hierarchical clustering. Normalization and transformation procedures have been proposed to favor the use of Euclidean and correlation based distances. Additionally, novel model-based dissimilarities that account for RNA-seq data characteristics have also been proposed. Adequacy of dissimilarity measures has been assessed using parametric simulations or exemplar datasets that may limit the scope of the conclusions. Here, we propose the simulation of realistic conditions through creation of plasmode datasets, to assess the adequacy of dissimilarity measures for sample-based hierarchical clustering of RNA-seq data. Consistent results were obtained using plasmode datasets based on RNA-seq experiments conducted under widely different conditions. Dissimilarity measures based on Euclidean distance that only considered data normalization or data standardization were not reliable to represent the expected hierarchical structure. Conversely, using either a Poisson-based dissimilarity or a rank correlation based dissimilarity or an appropriate data transformation, resulted in dendrograms that resemble the expected hierarchical structure. Plasmode datasets can be generated for a wide range of scenarios upon which dissimilarity measures can be evaluated for sample-based hierarchical clustering analysis. We showed different ways of generating such plasmodes and applied them to the problem of selecting a suitable dissimilarity measure. We report several measures that are satisfactory and the choice of a particular measure may rely on the availability on the software pipeline of preference.

Project description:Quantifying differences in species composition among communities provides important information related to the distribution, conservation and management of biodiversity, especially when two components are recognized: dissimilarity due to turnover, and dissimilarity due to richness differences. The ecoregions in central Mexico, within the Mexican Transition Zone, have outstanding environmental heterogeneity and harbor huge biological richness, besides differences in the origin of the biota. Therefore, biodiversity studies in this area require the use of complementary measures to achieve appropriate information that may help in the design of conservation strategies. In this work we analyze the dissimilarity of terrestrial vertebrates, and the components of turnover and richness differences, among six ecoregions in the state of Hidalgo, central Mexico. We follow two approaches: one based on species level dissimilarity, and the second on taxonomic dissimilarity. We used databases from the project "Biodiversity in the state of Hidalgo". Our results indicate that species dissimilarity is higher than taxonomic dissimilarity, and that turnover contributes more than richness differences, both for species and taxonomic total dissimilarity. Moreover, total dissimilarity, turnover dissimilarity and the dissimilarity due to richness differences were positively related in the four vertebrate groups. Reptiles had the highest values of dissimilarity, followed by mammals, amphibians and birds. For reptiles, birds, and mammals, species turnover was the most important component, while richness differences had a higher contribution for amphibians. The highest values of dissimilarity occurred between environmentally contrasting ecoregions (i.e., tropical and temperate forests), which suggests that environmental heterogeneity and differences in the origin of biotas are key factors driving beta diversity of terrestrial vertebrates among ecoregions in this complex area.

Dataset Information

CLARITY: comparing heterogeneous data using dissimilarity.

Publications

CLARITY: comparing heterogeneous data using dissimilarity.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets