Dataset Information

Automatic identification of variables in epidemiological datasets using logic regression.

ABSTRACT: For an individual participant data (IPD) meta-analysis, multiple datasets must be transformed in a consistent format, e.g. using uniform variable names. When large numbers of datasets have to be processed, this can be a time-consuming and error-prone task. Automated or semi-automated identification of variables can help to reduce the workload and improve the data quality. For semi-automation high sensitivity in the recognition of matching variables is particularly important, because it allows creating software which for a target variable presents a choice of source variables, from which a user can choose the matching one, with only low risk of having missed a correct source variable.For each variable in a set of target variables, a number of simple rules were manually created. With logic regression, an optimal Boolean combination of these rules was searched for every target variable, using a random subset of a large database of epidemiological and clinical cohort data (construction subset). In a second subset of this database (validation subset), this optimal combination rules were validated.In the construction sample, 41 target variables were allocated on average with a positive predictive value (PPV) of 34%, and a negative predictive value (NPV) of 95%. In the validation sample, PPV was 33%, whereas NPV remained at 94%. In the construction sample, PPV was 50% or less in 63% of all variables, in the validation sample in 71% of all variables.We demonstrated that the application of logic regression in a complex data management task in large epidemiological IPD meta-analyses is feasible. However, the performance of the algorithm is poor, which may require backup strategies.

SUBMITTER: Lorenz MW

PROVIDER: S-EPMC5390441 | biostudies-literature | 2017 Apr

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Automatic identification of variables in epidemiological datasets using logic regression.

Lorenz Matthias W MW Abdi Negin Ashtiani NA Scheckenbach Frank F Pflug Anja A Bülbül Alpaslan A Catapano Alberico L AL Agewall Stefan S Ezhov Marat M Bots Michiel L ML Kiechl Stefan S Orth Andreas A

BMC medical informatics and decision making 20170413 1

<h4>Background</h4>For an individual participant data (IPD) meta-analysis, multiple datasets must be transformed in a consistent format, e.g. using uniform variable names. When large numbers of datasets have to be processed, this can be a time-consuming and error-prone task. Automated or semi-automated identification of variables can help to reduce the workload and improve the data quality. For semi-automation high sensitivity in the recognition of matching variables is particularly important, b ...[more]

PMID: 28407816

Similar Datasets

Project description:Technological advances in the field of animal tracking have greatly expanded the potential to remotely monitor animals, opening the door to exploring how animals shift their behaviour over time or respond to external stimuli. A wide variety of animal-borne sensors can provide information on an animal's location, movement characteristics, external environmental conditions and internal physiological status. Here, we demonstrate how piecewise regression can be used to identify the presence and timing of potential shifts in a variety of biological responses using multiple biotelemetry data streams. Different biological latent states can be inferred by partitioning a time-series into multiple segments based on changes in modelled responses (e.g. their mean, variance, trend, degree of autocorrelation) and specifying a unique model structure for each interval. We provide six example applications highlighting a variety of taxonomic species, data streams, timescales and biological phenomena. These examples include a short-term behavioural response (flee and return) by a trumpeter swan Cygnus buccinator following a GPS collar deployment; remote identification of parturition based on movements by a pregnant moose Alces alces; a physiological response (spike in heart-rate) in a black bear Ursus americanus to a stressful stimulus (presence of a drone); a mortality event of a trumpeter swan signalled by changes in collar temperature and overall dynamic body acceleration; an unsupervised method for identifying the onset, return, duration and staging use of sandhill crane Antigone canadensis migration; and estimation of the transition between incubation and brood-rearing (i.e. hatching) for a breeding trumpeter swan. We implement analyses using the mcp package in R, which provides functionality for specifying and fitting a wide variety of user-defined model structures in a Bayesian framework and methods for assessing and comparing models using information criteria and cross-validation measures. These simple modelling approaches are accessible to a wide audience and offer a straightforward means of assessing a variety of biologically relevant changes in animal behaviour.

Dataset Information

Automatic identification of variables in epidemiological datasets using logic regression.

Publications

Automatic identification of variables in epidemiological datasets using logic regression.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets