Browse
Submit Data
Databases
API
Help

Dataset Information

0 Views

0 Connections

0 Citations

0 Reanalyses

0 Downloads

Omics score: 0

HAPNEST synthetic dataset

ABSTRACT: This synthetic dataset contains genetics data for 1,008,000 individuals and 9 continuous phenotypic traits with various genetic architectures. The dataset includes 6 ancestry groups (AFR, AMR, CSA, EAS, EUR, MID) and over 6.8 million single nucleotide polymorphisms (SNPs) across 22 chromosomes. The data was generated using the HAPNEST software program (https://github.com/intervene-EU-H2020/synthetic_data) developed by members of the INTERVENE consortium (https://www.interveneproject.eu/). This software has been specifically designed to enable efficient, large-scale synthetic data generation for common genetic variants and complex phenotypic traits. We have open sourced this software so that anyone can easily generate their own synthetic datasets. Please see the linked GitHub repository for further details. The reference dataset used to generate this synthetic dataset is the combined 1000 Genomes Project and Human Genomic Diversity Project datasets downloaded from https://gnomad.broadinstitute.org/downloads. The data was preprocessed by retaining SNPs with non-zero MAF in all populations for which rsID numbers could be successfully aligned. This resulted in over 6.8 million variants across 22 chromosomes.

ORGANISM(S): Homo sapiens (human)

SUBMITTER:

PROVIDER: S-BSST936 | biostudies-other |

REPOSITORIES: biostudies-other

ACCESS DATA

Json Xml

Similar Datasets

A synthetic building operation dataset.

Project description:This paper presents a synthetic building operation dataset which includes HVAC, lighting, miscellaneous electric loads (MELs) system operating conditions, occupant counts, environmental parameters, end-use and whole-building energy consumptions at 10-minute intervals. The data is created with 1395 annual simulations using the U.S. DOE detailed medium-sized reference office building, and 30 years' historical weather data in three typical climates including Miami, San Francisco, and Chicago. Three energy efficiency levels of the building and systems are considered. Assumptions regarding occupant movements, occupants' diverse temperature preferences, lighting, and MELs are adopted to reflect realistic building operations. A semantic building metadata schema - BRICK, is used to store the building metadata. The dataset is saved in a 1.2 TB of compressed HDF5 file. This dataset can be used in various applications, including building energy and load shape benchmarking, energy model calibration, evaluation of occupant and weather variability and their influences on building performance, algorithm development and testing for thermal and energy load prediction, model predictive control, policy development for reinforcement learning based building controls.

| S-EPMC8355154 | biostudies-literature

Synthetic dataset for visco-acoustic imaging.

Project description:We provide computationally generated dataset simulating propagation of ultrasonic waves in viscous tissues in two and three dimensional domains. The dataset contains physical parameters of a human breast with a high-contrast inclusion, the acquisition setup with positions of sources and receivers, and the associated pressure-wave data at ultrasonic frequencies. We simulated the wave propagation based on seven different viscous models using the physical parameters of the breast. Furthermore, different choices of conditions for the medium's boundaries are given, namely absorbing and reflecting boundaries. The dataset allows to evaluate the performance of reconstruction methods for ultrasound imaging under attenuation model uncertainty, that is, when the precise attenuation law that characterizes the medium is unknown. In addition, the dataset enables to evaluate the robustness of inverse scheme in the context of reflecting boundary conditions where multiple reflections illuminate the sample, and/or the performance of data-processing to suppress these multiple reflections.

| S-EPMC10192678 | biostudies-literature

A synthetic dataset of liver disorder patients.

Project description:The data in this article include 10,000 synthetic patients with liver disorders, characterized by 70 different variables, including clinical features, and patient outcomes, such as hospital admission or surgery. Patient data are generated, simulating as close as possible real patient data, using a publicly available Bayesian network describing a casual model for liver disorders. By varying the network parameters, we also generated an additional set of 500 patients with characteristics that deviated from the initial patient population. We provide an overview of the synthetic data generation process and the associated scripts for generating the cohorts. This dataset can be useful for the machine learning models training and validation, especially under the effect of dataset shift between training and testing sets.

| S-EPMC9898618 | biostudies-literature

A dataset of synthetic art dialogues with ChatGPT.

Project description:This paper introduces Art_GenEvalGPT, a novel dataset of synthetic dialogues centered on art generated through ChatGPT. Unlike existing datasets focused on conventional art-related tasks, Art_GenEvalGPT delves into nuanced conversations about art, encompassing a wide variety of artworks, artists, and genres, and incorporating emotional interventions, integrating speakers' subjective opinions and different roles for the conversational agents (e.g., teacher-student, expert guide, anthropic behavior or handling toxic users). Generation and evaluation stages of GenEvalGPT platform are used to create the dataset, which includes 13,870 synthetic dialogues, covering 799 distinct artworks, 378 different artists, and 26 art styles. Automatic and manual assessment proof the high quality of the synthetic dialogues generated. For the profile recovery, promising lexical and semantic metrics for objective and factual attributes are offered. For subjective attributes, the evaluation for detecting emotions or subjectivity in the interventions achieves 92% of accuracy using LLM-self assessment metrics.

| S-EPMC11283562 | biostudies-literature

A synthetic Longitudinal Study dataset for England and Wales.

Project description:This article describes the new synthetic England and Wales Longitudinal Study 'spine' dataset designed for teaching and experimentation purposes. In the United Kingdom, there exist three Census-based longitudinal micro-datasets, known collectively as the Longitudinal Studies. The England and Wales Longitudinal Study (LS) is a 1% sample of the population of England and Wales (around 500,000 individuals), linking individual person records from the 1971 to 2011 Censuses. The synthetic data presented contains a similar number of individuals to the original data and accurate longitudinal transitions between 2001 and 2011 for key demographic variables, but unlike the original data, is open access.

| S-EPMC5021767 | biostudies-literature

synthetic sequence dataset database metagenome

Project description:synthetic sequence dataset database metagenome

| PRJEB4579 | ENA

A North Atlantic synthetic tropical cyclone track, intensity, and rainfall dataset.

Project description:Tropical Cyclones (TCs) cause significant socio-economic damages to the US and Caribbean coastal regions annually, making it important to understand TC risk at the local-to-regional scales. However, the short length of the observed record and the substantial computational expense associated with high-resolution climate models make it difficult to assess TC risk using either approach. To overcome these challenges, we developed a database of synthetic TCs using the Risk Analysis Framework for Tropical Cyclones (RAFT). The database includes 40,000 synthetic TC tracks, along-track intensities and storm-induced precipitation. TC tracks generated in RAFT are in reasonable agreement with the observed spatial distribution of TC tracks and basin-scale TC statistics. Specifically along the coast, spatial variations in TC crossing probability and extreme winds upon landfall are well-reproduced by RAFT with R-squared values of 0.81 and 0.73, respectively. In summary, the synthetic TC database constructed with RAFT provides a reasonable pathway for the robust assessment of North Atlantic TC wind and rainfall risks.

| S-EPMC10811331 | biostudies-literature

Synthetic feature pairs dataset and siamese convolutional model for image matching.

Project description:In a previous publication [1], we created a dataset of feature patches for detection model training. In this paper, we use the same patches to create a new large synthetic dataset of feature pairs, similar and different, in order to perform, thanks to a siamese convolutional model, the description and matching of the detected features. We thus complete the entire matching pipeline. The accurate manual labeling of image features being very difficult because of their large number and the various associated parameters of position, scale and rotation, recent deep learning models use the result of handcrafted methods for training. Compared to existing datasets, ours avoids model training with false detections of the extraction of feature patches by other algorithms, or with inaccuracy errors of manual labeling. The other advantage of synthetic patches is that we can control their content (corners, edges, etc.), as well as their geometric and photometric parameters, and therefore we control the invariance of the model. The proposed datasets thus allow a new approach to train the different matching modules without using traditional methods. To our knowledge, these are the first feature datasets based on generated synthetic patches for image matching.

| S-EPMC8873551 | biostudies-literature

Synthetic car dataset for vehicle detection: Integrating aerial and satellite imagery.

Project description:Vehicle detection is a very important aspect of computer vision application to aerial and satellite imagery, facilitating activities such as instance counting, velocity estimation, traffic predictions, etc. The feasibility of accurate vehicle detection often depends on limited training datasets, requiring a lot of manual work in collection and annotation tasks. Furthermore, there are no known publicly available datasets. Our aim was to construct a pipeline for synthetic dataset generation from aerial imagery and 3D models in Blender software. The dataset generation pipeline consists of seven steps and results in a wished number of images with bounding boxes in YOLO and coco formats. This synthetic dataset has been produced following the steps described in this pipeline. It consists of 5000 2048 × 2048 images with cars inserted into the roads and highways at the images without cars from all over the world. We believe that this dataset and the respective pipeline might be of great importance for vehicle detection, facilitating the customizability of the models to specific needs and context.

| S-EPMC10875228 | biostudies-literature

Generation of a global synthetic tropical cyclone hazard dataset using STORM.

Project description:Over the past few decades, the world has seen substantial tropical cyclone (TC) damages, with the 2017 Hurricanes Harvey, Irma and Maria entering the top-5 costliest Atlantic hurricanes ever. Calculating TC risk at a global scale, however, has proven difficult given the limited temporal and spatial information on TCs across much of the global coastline. Here, we present a novel database on TC characteristics on a global scale using a newly developed synthetic resampling algorithm we call STORM (Synthetic Tropical cyclOne geneRation Model). STORM can be applied to any meteorological dataset to statistically resample and model TC tracks and intensities. We apply STORM to extracted TCs from 38 years of historical data from IBTrACS to statistically extend this dataset to 10,000 years of TC activity. We show that STORM preserves the TC statistics as found in the original dataset. The STORM dataset can be used for TC hazard assessments and risk modeling in TC-prone regions.

| S-EPMC7005259 | biostudies-literature

OmicsDI is part of the ELIXIR infrastructure

OmicsDI is an Elixir interoperability service. Learn more ›

Tweets

OmicsDI Databases

PRIDE
PeptideAtlas
MassIVE
JPOST Repository
Physiome Model Repository

EGA
EVA
ENA
LINCS
PAXDB
Cell Collective

MetaboLights
Metabolomics Workbench
MetabolomeExpress
GNPS
BioModels
FAIRDOMHub

ArrayExpress
dbGaP
ExpressionAtlas
GEO
NODE

Information

Databases
Help
API
Contact us
Code on GitHub
Terms of use
Submit Data