Dataset Information

FDTool: a Python application to mine for functional dependencies and candidate keys in tabular data.

ABSTRACT: Functional dependencies (FDs) and candidate keys are essential for table decomposition, database normalization, and data cleansing. In this paper, we present FDTool, a command line Python application to discover minimal FDs in tabular datasets and infer equivalent attribute sets and candidate keys from them. The runtime and memory costs associated with seven published FD discovery algorithms are given with an overview of their theoretical foundations. Previous research establishes that FD_Mine is the most efficient FD discovery algorithm when applied to datasets with many rows (> 100,000 rows) and few columns (< 14 columns). This puts it in a special position to rule mine clinical and demographic datasets, which often consist of long and narrow sets of participant records. The structure of FD_Mine is described and supplemented with a formal proof of the equivalence pruning method used. FDTool is a re-implementation of FD_Mine with additional features added to improve performance and automate typical processes in database architecture. The experimental results of applying FDTool to 13 datasets of different dimensions are summarized in terms of the number of FDs checked, the number of FDs found, and the time it takes for the code to terminate. We find that the number of attributes in a dataset has a much greater effect on the runtime and memory costs of FDTool than does row count. The last section explains in detail how the FDTool application can be accessed, executed, and further developed.

SUBMITTER: Buranosky M

PROVIDER: S-EPMC6489977 | biostudies-literature | 2018

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

FDTool: a Python application to mine for functional dependencies and candidate keys in tabular data.

Buranosky Matt M Stellnberger Elmar E Pfaff Emily E Diaz-Sanchez David D Ward-Caviness Cavin C

F1000Research 20181019

Functional dependencies (FDs) and candidate keys are essential for table decomposition, database normalization, and data cleansing. In this paper, we present FDTool, a command line Python application to discover minimal FDs in tabular datasets and infer equivalent attribute sets and candidate keys from them. The runtime and memory costs associated with seven published FD discovery algorithms are given with an overview of their theoretical foundations. Previous research establishes that FD_Mine i ...[more]

PMID: 31069050

Similar Datasets

Project description:UnlabelledA severe, sometimes fatal respiratory disease has been observed in captive ball pythons (Python regius) since the late 1990s. In order to better understand this disease and its etiology, we collected case and control samples and performed pathological and diagnostic analyses. Electron micrographs revealed filamentous virus-like particles in lung epithelial cells of sick animals. Diagnostic testing for known pathogens did not identify an etiologic agent, so unbiased metagenomic sequencing was performed. Abundant nidovirus-like sequences were identified in cases and were used to assemble the genome of a previously unknown virus in the order Nidovirales. The nidoviruses, which were not previously known to infect nonavian reptiles, are a diverse order that includes important human and veterinary pathogens. The presence of the viral RNA was confirmed in all diseased animals (n = 8) but was not detected in healthy pythons or other snakes (n = 57). Viral RNA levels were generally highest in the lung and other respiratory tract tissues. The 33.5-kb viral genome is the largest RNA genome yet described and shares canonical characteristics with other nidovirus genomes, although several features distinguish this from related viruses. This virus, which we named ball python nidovirus (BPNV), will likely establish a new genus in Torovirinae subfamily. The identification of a novel nidovirus in reptiles contributes to our understanding of the biology and evolution of related viruses, and its association with lung disease in pythons is a promising step toward elucidating an etiology for this long-standing veterinary disease.ImportanceBall pythons are popular pets because of their diverse coloration, generally nonaggressive behavior, and relatively small size. Since the 1990s, veterinarians have been aware of an infectious respiratory disease of unknown cause in ball pythons that can be fatal. We used unbiased shotgun sequencing to discover a novel virus in the order Nidovirales that was present in cases but not controls. While nidoviruses are known to infect a variety of animals, this is the first report of a nidovirus recovered from any reptile. This report will enable diagnostics that will assist in determining the role of this virus in the causation of disease, which would allow control of the disease in zoos and private collections. Given its evolutionary divergence from known nidoviruses and its unique host, the study of reptile nidoviruses may further our understanding of related diseases and the viruses that cause them in humans and other animals.

Project description:Classification algorithms assign observations to groups based on patterns in data. The machine-learning community have developed myriad classification algorithms, which are used in diverse life science research domains. Algorithm choice can affect classification accuracy dramatically, so it is crucial that researchers optimize the choice of which algorithm(s) to apply in a given research domain on the basis of empirical evidence. In benchmark studies, multiple algorithms are applied to multiple datasets, and the researcher examines overall trends. In addition, the researcher may evaluate multiple hyperparameter combinations for each algorithm and use feature selection to reduce data dimensionality. Although software implementations of classification algorithms are widely available, robust benchmark comparisons are difficult to perform when researchers wish to compare algorithms that span multiple software packages. Programming interfaces, data formats, and evaluation procedures differ across software packages; and dependency conflicts may arise during installation. To address these challenges, we created ShinyLearner, an open-source project for integrating machine-learning packages into software containers. ShinyLearner provides a uniform interface for performing classification, irrespective of the library that implements each algorithm, thus facilitating benchmark comparisons. In addition, ShinyLearner enables researchers to optimize hyperparameters and select features via nested cross-validation; it tracks all nested operations and generates output files that make these steps transparent. ShinyLearner includes a Web interface to help users more easily construct the commands necessary to perform benchmark comparisons. ShinyLearner is freely available at https://github.com/srp33/ShinyLearner. This software is a resource to researchers who wish to benchmark multiple classification or feature-selection algorithms on a given dataset. We hope it will serve as example of combining the benefits of software containerization with a user-friendly approach.

Dataset Information

FDTool: a Python application to mine for functional dependencies and candidate keys in tabular data.

Publications

FDTool: a Python application to mine for functional dependencies and candidate keys in tabular data.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets