Dataset Information

G-Tric: generating three-way synthetic datasets with triclustering solutions.

ABSTRACT:

Background

Three-way data started to gain popularity due to their increasing capacity to describe inherently multivariate and temporal events, such as biological responses, social interactions along time, urban dynamics, or complex geophysical phenomena. Triclustering, subspace clustering of three-way data, enables the discovery of patterns corresponding to data subspaces (triclusters) with values correlated across the three dimensions (observations [Formula: see text] features [Formula: see text] contexts). With increasing number of algorithms being proposed, effectively comparing them with state-of-the-art algorithms is paramount. These comparisons are usually performed using real data, without a known ground-truth, thus limiting the assessments. In this context, we propose a synthetic data generator, G-Tric, allowing the creation of synthetic datasets with configurable properties and the possibility to plant triclusters. The generator is prepared to create datasets resembling real 3-way data from biomedical and social data domains, with the additional advantage of further providing the ground truth (triclustering solution) as output.

Results

G-Tric can replicate real-world datasets and create new ones that match researchers needs across several properties, including data type (numeric or symbolic), dimensions, and background distribution. Users can tune the patterns and structure that characterize the planted triclusters (subspaces) and how they interact (overlapping). Data quality can also be controlled, by defining the amount of missing, noise or errors. Furthermore, a benchmark of datasets resembling real data is made available, together with the corresponding triclustering solutions (planted triclusters) and generating parameters.

Conclusions

Triclustering evaluation using G-Tric provides the possibility to combine both intrinsic and extrinsic metrics to compare solutions that produce more reliable analyses. A set of predefined datasets, mimicking widely used three-way data and exploring crucial properties was generated and made available, highlighting G-Tric's potential to advance triclustering state-of-the-art by easing the process of evaluating the quality of new triclustering approaches.

SUBMITTER: Lobo J

PROVIDER: S-EPMC7789692 | biostudies-literature | 2021 Jan

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

G-Tric: generating three-way synthetic datasets with triclustering solutions.

Lobo João J Henriques Rui R Madeira Sara C SC

BMC bioinformatics 20210107 1

<h4>Background</h4>Three-way data started to gain popularity due to their increasing capacity to describe inherently multivariate and temporal events, such as biological responses, social interactions along time, urban dynamics, or complex geophysical phenomena. Triclustering, subspace clustering of three-way data, enables the discovery of patterns corresponding to data subspaces (triclusters) with values correlated across the three dimensions (observations [Formula: see text] features [Formula: ...[more]

PMID: 33413095

Similar Datasets

Project description:This paper presents datasets utilised for synthetic near-infrared (NIR) image generation and bounding-box level fruit detection systems. A high-quality dataset is one of the essential building blocks that can lead to success in model generalisation and the deployment of data-driven deep neural networks. In particular, synthetic data generation tasks often require more training samples than other supervised approaches. Therefore, in this paper, we share the NIR+RGB datasets that are re-processed from two public datasets (i.e., nirscene and SEN12MS), expanded our previous study, deepFruits, and our novel NIR+RGB sweet pepper (capsicum) dataset. We oversampled from the original nirscene dataset at 10, 100, 200, and 400 ratios that yielded a total of 127 k pairs of images. From the SEN12MS satellite multispectral dataset, we selected Summer (45 k) and All seasons (180k) subsets and applied a simple yet important conversion: digital number (DN) to pixel value conversion followed by image standardisation. Our sweet pepper dataset consists of 1615 pairs of NIR+RGB images that were collected from commercial farms. We quantitatively and qualitatively demonstrate that these NIR+RGB datasets are sufficient to be used for synthetic NIR image generation. We achieved Frechet inception distances (FIDs) of 11.36, 26.53, and 40.15 for nirscene1, SEN12MS, and sweet pepper datasets, respectively. In addition, we release manual annotations of 11 fruit bounding boxes that can be exported in various formats using cloud service. Four newly added fruits (blueberry, cherry, kiwi and wheat) compound 11 novel bounding box datasets on top of our previous work presented in the deepFruits project (apple, avocado, capsicum, mango, orange, rockmelon and strawberry). The total number of bounding box instances of the dataset is 162 k and it is ready to use from a cloud service. For the evaluation of the dataset, Yolov5 single stage detector is exploited and reported impressive mean-average-precision, mAP[0.5:0.95] results of min:0.49, max:0.812. We hope these datasets are useful and serve as a baseline for future studies.

Project description:IntroductionWhole-metagenome sequencing can be a rich source of information about the structure and function of entire metagenomic communities, but getting accurate and reliable results from these datasets can be challenging. Analysis of these datasets is founded on the mapping of sequencing reads onto known genomic regions from known organisms, but short reads will often map equally well to multiple regions, and to multiple reference organisms. Assembling metagenomic datasets prior to mapping can generate much longer and more precisely mappable sequences but the presence of closely related organisms and highly conserved regions makes metagenomic assembly challenging, and some regions of particular interest can assemble poorly. One solution to these problems is to use specialised tools, such as Kelpie, that can accurately extract and assemble full-length sequences for defined genomic regions from whole-metagenome datasets.MethodsKelpie is a kMer-based tool that generates full-length amplicon-like sequences from whole-metagenome datasets. It takes a pair of primer sequences and a set of metagenomic reads, and uses a combination of kMer filtering, error correction and assembly techniques to construct sets of full-length inter-primer sequences.ResultsThe effectiveness of Kelpie is demonstrated here through the extraction and assembly of full-length ribosomal marker gene regions, as this allows comparisons with conventional amplicon sequencing and published metagenomic benchmarks. The results show that the Kelpie-generated sequences and community profiles closely match those produced by amplicon sequencing, down to low abundance levels, and running Kelpie on the synthetic CAMI metagenomic benchmarking datasets shows similar high levels of both precision and recall.ConclusionsKelpie can be thought of as being somewhat like an in-silico PCR tool, taking a primer pair and producing the resulting 'amplicons' from a whole-metagenome dataset. Marker regions from the 16S rRNA gene were used here as an example because this allowed the overall accuracy of Kelpie to be evaluated through comparisons with other datasets, approaches and benchmarks. Kelpie is not limited to this application though, and can be used to extract and assemble any genomic region present in a whole metagenome dataset, as long as it is bound by a pairs of highly conserved primer sequences.

Project description:Getting access to administrative health data for research purposes is a difficult and time-consuming process due to increasingly demanding privacy regulations. An alternative method for sharing administrative health data would be to share synthetic datasets where the records do not correspond to real individuals, but the patterns and relationships seen in the data are reproduced. This paper assesses the feasibility of generating synthetic administrative health data using a recurrent deep learning model. Our data comes from 120,000 individuals from Alberta Health's administrative health database. We assess how similar our synthetic data is to the real data using utility assessments that assess the structure and general patterns in the data as well as by recreating a specific analysis in the real data commonly applied to this type of administrative health data. We also assess the privacy risks associated with the use of this synthetic dataset. Generic utility assessments that used Hellinger distance to quantify the difference in distributions between real and synthetic datasets for event types (0.027), attributes (mean 0.0417), Markov transition matrices (order 1 mean absolute difference: 0.0896, sd: 0.159; order 2: mean Hellinger distance 0.2195, sd: 0.2724), the Hellinger distance between the joint distributions was 0.352, and the similarity of random cohorts generated from real and synthetic data had a mean Hellinger distance of 0.3 and mean Euclidean distance of 0.064, indicating small differences between the distributions in the real data and the synthetic data. By applying a realistic analysis to both real and synthetic datasets, Cox regression hazard ratios achieved a mean confidence interval overlap of 68% for adjusted hazard ratios among 5 key outcomes of interest, indicating synthetic data produces similar analytic results to real data. The privacy assessment concluded that the attribution disclosure risk associated with this synthetic dataset was substantially less than the typical 0.09 acceptable risk threshold. Based on these metrics our results show that our synthetic data is suitably similar to the real data and could be shared for research purposes thereby alleviating concerns associated with the sharing of real data in some circumstances.

Project description:IntroductionFederated Learning (FL) is a decentralised approach to training statistical models, where training is performed across multiple clients, producing one global model. Since the training data remains with each local client and is not shared or exchanged with other clients the use of FL may reduce privacy and security risks (compared to methods where multiple data sources are pooled) and can also address data access and heterogeneity problems. Synthetic data is artificially generated data that has the same structure and statistical properties as the original but that does not contain any of the original data records, therefore minimising disclosure risk. Using FL to produce synthetic data (which we refer to as "federated synthesis") has the potential to combine data from multiple clients without compromising privacy, allowing access to data that may otherwise be inaccessible in its raw format.ObjectivesThe objective was to review current research and practices for using FL to generate synthetic data and determine the extent to which research has been undertaken, the methods and evaluation practices used, and any research gaps.MethodsA scoping review was conducted to systematically map and describe the published literature on the use of FL to generate synthetic data. Relevant studies were identified through online databases and the findings are described, grouped, and summarised. Information extracted included article characteristics, documenting the type of data that is synthesised, the model architecture and the methods (if any) used to evaluate utility and privacy risk.ResultsA total of 69 articles were included in the scoping review; all were published between 2018 and 2023 with two thirds (46) in 2022. 30% (21) were focussed on synthetic data generation as the main model output (with 6 of these generating tabular data), whereas 59% (41) focussed on data augmentation. Of the 21 performing federated synthesis, all used deep learning methods (predominantly Generative Adversarial Networks) to generate the synthetic data.ConclusionsFederated synthesis is in its early days but shows promise as a method that can construct a global synthetic dataset without sharing any of the local client data. As a field in its infancy there are areas to explore in terms of the privacy risk associated with the various methods proposed, and more generally in how we measure those risks.

Project description:Background. Datasets consisting of synthetic neural data generated with quantifiable and controlled parameters are a valuable asset in the process of testing and validating directed functional connectivity metrics. Considering the recent debate in the neuroimaging community concerning the use of these metrics for fMRI data, synthetic datasets that emulate the BOLD signal dynamics have played a central role by supporting claims that argue in favor or against certain choices. Generative models often used in studies that simulate neuronal activity, with the aim of gaining insight into specific brain regions and functions, have different requirements from the generative models for benchmarking datasets. Even though the latter must be realistic, there is a tradeoff between realism and computational demand that needs to be contemplated and simulations that efficiently mimic the real behavior of single neurons or neuronal populations are preferred, instead of more cumbersome and marginally precise ones. Methods. This work explores how simple generative models are able to produce neuronal datasets, for benchmarking purposes, that reflect the simulated effective connectivity and, how these can be used to obtain synthetic recordings of EEG and fMRI BOLD signals. The generative models covered here are AR processes, neural mass models consisting of linear and nonlinear stochastic differential equations and populations with thousands of spiking units. Forward models for EEG consist in the simple three-shell head model while the fMRI BOLD signal is modeled with the Balloon-Windkessel model or by convolution with a hemodynamic response function. Results. The simulated datasets are tested for causality with the original spectral formulation for Granger causality. Modeled effective connectivity can be detected in the generated data for varying connection strengths and interaction delays. Discussion. All generative models produce synthetic neuronal data with detectable causal effects although the relation between modeled and detected causality varies and less biophysically realistic models offer more control in causal relations such as modeled strength and frequency location.

Dataset Information

G-Tric: generating three-way synthetic datasets with triclustering solutions.

Background

Results

Conclusions

Publications

G-Tric: generating three-way synthetic datasets with triclustering solutions.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets