Unknown

Dataset Information

0

Automatic identification of variables in epidemiological datasets using logic regression.


ABSTRACT: For an individual participant data (IPD) meta-analysis, multiple datasets must be transformed in a consistent format, e.g. using uniform variable names. When large numbers of datasets have to be processed, this can be a time-consuming and error-prone task. Automated or semi-automated identification of variables can help to reduce the workload and improve the data quality. For semi-automation high sensitivity in the recognition of matching variables is particularly important, because it allows creating software which for a target variable presents a choice of source variables, from which a user can choose the matching one, with only low risk of having missed a correct source variable.For each variable in a set of target variables, a number of simple rules were manually created. With logic regression, an optimal Boolean combination of these rules was searched for every target variable, using a random subset of a large database of epidemiological and clinical cohort data (construction subset). In a second subset of this database (validation subset), this optimal combination rules were validated.In the construction sample, 41 target variables were allocated on average with a positive predictive value (PPV) of 34%, and a negative predictive value (NPV) of 95%. In the validation sample, PPV was 33%, whereas NPV remained at 94%. In the construction sample, PPV was 50% or less in 63% of all variables, in the validation sample in 71% of all variables.We demonstrated that the application of logic regression in a complex data management task in large epidemiological IPD meta-analyses is feasible. However, the performance of the algorithm is poor, which may require backup strategies.

SUBMITTER: Lorenz MW 

PROVIDER: S-EPMC5390441 | biostudies-literature | 2017 Apr

REPOSITORIES: biostudies-literature

altmetric image

Publications

Automatic identification of variables in epidemiological datasets using logic regression.

Lorenz Matthias W MW   Abdi Negin Ashtiani NA   Scheckenbach Frank F   Pflug Anja A   Bülbül Alpaslan A   Catapano Alberico L AL   Agewall Stefan S   Ezhov Marat M   Bots Michiel L ML   Kiechl Stefan S   Orth Andreas A  

BMC medical informatics and decision making 20170413 1


<h4>Background</h4>For an individual participant data (IPD) meta-analysis, multiple datasets must be transformed in a consistent format, e.g. using uniform variable names. When large numbers of datasets have to be processed, this can be a time-consuming and error-prone task. Automated or semi-automated identification of variables can help to reduce the workload and improve the data quality. For semi-automation high sensitivity in the recognition of matching variables is particularly important, b  ...[more]

Similar Datasets

| S-EPMC10132077 | biostudies-literature
| S-EPMC8723155 | biostudies-literature
| S-EPMC5675816 | biostudies-literature
| S-EPMC9540865 | biostudies-literature
| S-EPMC5407271 | biostudies-literature
| S-EPMC3287827 | biostudies-other
| S-EPMC8453245 | biostudies-literature
| S-EPMC3413079 | biostudies-literature
| S-EPMC5358897 | biostudies-literature
| S-EPMC10069785 | biostudies-literature