Dataset Information

Comparison of bias analysis strategies applied to a large data set.

ABSTRACT: Epidemiologic data sets continue to grow larger. Probabilistic-bias analyses, which simulate hundreds of thousands of replications of the original data set, may challenge desktop computational resources.We implemented a probabilistic-bias analysis to evaluate the direction, magnitude, and uncertainty of the bias arising from misclassification of prepregnancy body mass index when studying its association with early preterm birth in a cohort of 773,625 singleton births. We compared 3 bias analysis strategies: (1) using the full cohort, (2) using a case-cohort design, and (3) weighting records by their frequency in the full cohort.Underweight and overweight mothers were more likely to deliver early preterm. A validation substudy demonstrated misclassification of prepregnancy body mass index derived from birth certificates. Probabilistic-bias analyses suggested that the association between underweight and early preterm birth was overestimated by the conventional approach, whereas the associations between overweight categories and early preterm birth were underestimated. The 3 bias analyses yielded equivalent results and challenged our typical desktop computing environment. Analyses applied to the full cohort, case cohort, and weighted full cohort required 7.75 days and 4 terabytes, 15.8 hours and 287 gigabytes, and 8.5 hours and 202 gigabytes, respectively.Large epidemiologic data sets often include variables that are imperfectly measured, often because data were collected for other purposes. Probabilistic-bias analysis allows quantification of errors but may be difficult in a desktop computing environment. Solutions that allow these analyses in this environment can be achieved without new hardware and within reasonable computational time frames.

SUBMITTER: Lash TL

PROVIDER: S-EPMC4306386 | biostudies-literature | 2014 Jul

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Comparison of bias analysis strategies applied to a large data set.

Lash Timothy L TL Abrams Barbara B Bodnar Lisa M LM

Epidemiology (Cambridge, Mass.) 20140701 4

<h4>Background</h4>Epidemiologic data sets continue to grow larger. Probabilistic-bias analyses, which simulate hundreds of thousands of replications of the original data set, may challenge desktop computational resources.<h4>Methods</h4>We implemented a probabilistic-bias analysis to evaluate the direction, magnitude, and uncertainty of the bias arising from misclassification of prepregnancy body mass index when studying its association with early preterm birth in a cohort of 773,625 singleton ...[more]

PMID: 24815306

Dataset Information

Comparison of bias analysis strategies applied to a large data set.

Publications

Comparison of bias analysis strategies applied to a large data set.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

Bias in Gene-Set Analysis Applied to High-throughput Methylation Data
2012-07-13 | GSE39188 | GEO

Bias in Gene-Set Analysis Applied to High-throughput Methylation Data
2012-07-12 | E-GEOD-39188 | biostudies-arrayexpress

Gene set analysis methods applied to chicken microarray expression data.
| S-EPMC2712751 | biostudies-literature

Classification of a large microarray data set: algorithm comparison and analysis of drug signatures.
| S-EPMC1088301 | biostudies-literature

Multi-Set Testing Strategies Show Good Behavior When Applied to Very Large Sets of Rare Variants.
| S-EPMC7680887 | biostudies-literature

Strategies for controlling non-transmissible infection outbreaks using a large human movement data set.
| S-EPMC4161289 | biostudies-literature

dbVar structural variant cluster set for data analysis and variant comparison.
| S-EPMC5345777 | biostudies-literature

Strategies for MCR image analysis of large hyperspectral data-sets.
| S-EPMC3579489 | biostudies-literature

Comparison of differential accessibility analysis strategies for ATAC-seq data.
| S-EPMC7311460 | biostudies-literature

A co-localization model of paired ChIP-seq data using a large ENCODE data set enables comparison of multiple samples.
| S-EPMC3592427 | biostudies-literature