Dataset Information

Real and synthetic data sets for benchmarking key-value stores focusing on various data types and sizes.

ABSTRACT: In this article, we present real and synthetic data sets for benchmarking key-values stores. Here, we focus on various data types and sizes. Key-value pairs in key-value data sets consist of the key and the value. We can construct any kinds of data as key-value data sets by assigning an arbitrary type of data as the value and a unique ID as the key. Therefore, key-value pairs are quite worthy when we deal with big data because the data types in the big data application become more various and, even sometimes, they are not known or determined. In this article, we crawl four kinds of real data sets by varying the type of data sets (i.e., variety) and generate four kinds of synthetic data sets by varying the size of data sets (i.e., volume). For real data sets, we crawl data sets with various data types from Twitter, i.e., Tweets in text, a list of hashtags, geo-location of the tweet, and the number of followers. We also present algorithms for crawling real data sets based on REST APIs and streaming APIs and for generating synthetic data sets. Using those algorithms, we can crawl any key-value pairs of data types supported by Twitter and can generate any size of synthetic data sets by extending them simply. Last, we show that the crawled and generated data sets are actually utilized for the well-known key-value stores such as Level DB of Google, RocksDB of Facebook, and Berkeley DB of Oracle. Actually, the presented real and synthetic data sets have been used for comparing the performance of them. As an example, we present an algorithm of the basic operations for the key-value stores of LevelDB.

SUBMITTER: Kwon HY

PROVIDER: S-EPMC7160529 | biostudies-literature | 2020 Jun

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Real and synthetic data sets for benchmarking key-value stores focusing on various data types and sizes.

Kwon Hyuk-Yoon HY

Data in brief 20200320

In this article, we present real and synthetic data sets for benchmarking key-values stores. Here, we focus on various data types and sizes. Key-value pairs in key-value data sets consist of the key and the value. We can construct any kinds of data as key-value data sets by assigning an arbitrary type of data as the value and a unique ID as the key. Therefore, key-value pairs are quite worthy when we deal with big data because the data types in the big data application become more various and, e ...[more]

PMID: 32322613

Similar Datasets

Project description:The ecological effects of accidental or malicious radioactive contamination are insufficiently understood because of the hazards and difficulties associated with conducting studies in radioactively-polluted areas. Data sets from severely contaminated locations can therefore be small. Moreover, many potentially important factors, such as soil concentrations of toxic chemicals, pH, and temperature, can be correlated with radiation levels and with each other. In such situations, commonly-used statistical techniques like generalized linear models (GLMs) may not be able to provide useful information about how radiation and/or these other variables affect the outcome (e.g. abundance of the studied organisms). Ensemble machine learning methods such as random forests offer powerful alternatives. We propose that analysis of small radioecological data sets by GLMs and/or machine learning can be made more informative by using the following techniques: (1) adding synthetic noise variables to provide benchmarks for distinguishing the performances of valuable predictors from irrelevant ones; (2) adding noise directly to the predictors and/or to the outcome to test the robustness of analysis results against random data fluctuations; (3) adding artificial effects to selected predictors to test the sensitivity of the analysis methods in detecting predictor effects; (4) running a selected machine learning method multiple times (with different random-number seeds) to test the robustness of the detected "signal"; (5) using several machine learning methods to test the "signal's" sensitivity to differences in analysis techniques. Here, we applied these approaches to simulated data, and to two published examples of small radioecological data sets: (I) counts of fungal taxa in samples of soil contaminated by the Chernobyl nuclear power plan accident (Ukraine), and (II) bacterial abundance in soil samples under a ruptured nuclear waste storage tank (USA). We show that the proposed techniques were advantageous compared with the methodology used in the original publications where the data sets were presented. Specifically, our approach identified a negative effect of radioactive contamination in data set I, and suggested that in data set II stable chromium could have been a stronger limiting factor for bacterial abundance than the radionuclides 137Cs and 99Tc. This new information, which was extracted from these data sets using the proposed techniques, can potentially enhance the design of radioactive waste bioremediation.

Dataset Information

Real and synthetic data sets for benchmarking key-value stores focusing on various data types and sizes.

Publications

Real and synthetic data sets for benchmarking key-value stores focusing on various data types and sizes.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets