Dataset Information

Multiple imputation of missing data in nested case-control and case-cohort studies.

ABSTRACT: The nested case-control and case-cohort designs are two main approaches for carrying out a substudy within a prospective cohort. This article adapts multiple imputation (MI) methods for handling missing covariates in full-cohort studies for nested case-control and case-cohort studies. We consider data missing by design and data missing by chance. MI analyses that make use of full-cohort data and MI analyses based on substudy data only are described, alongside an intermediate approach in which the imputation uses full-cohort data but the analysis uses only the substudy. We describe adaptations to two imputation methods: the approximate method (MI-approx) of White and Royston (2009) and the "substantive model compatible" (MI-SMC) method of Bartlett et al. (2015). We also apply the "MI matched set" approach of Seaman and Keogh (2015) to nested case-control studies, which does not require any full-cohort information. The methods are investigated using simulation studies and all perform well when their assumptions hold. Substantial gains in efficiency can be made by imputing data missing by design using the full-cohort approach or by imputing data missing by chance in analyses using the substudy only. The intermediate approach brings greater gains in efficiency relative to the substudy approach and is more robust to imputation model misspecification than the full-cohort approach. The methods are illustrated using the ARIC Study cohort. Supplementary Materials provide R and Stata code.

SUBMITTER: Keogh RH

PROVIDER: S-EPMC6481559 | biostudies-literature | 2018 Dec

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Multiple imputation of missing data in nested case-control and case-cohort studies.

Keogh Ruth H RH Seaman Shaun R SR Bartlett Jonathan W JW Wood Angela M AM

Biometrics 20180605 4

The nested case-control and case-cohort designs are two main approaches for carrying out a substudy within a prospective cohort. This article adapts multiple imputation (MI) methods for handling missing covariates in full-cohort studies for nested case-control and case-cohort studies. We consider data missing by design and data missing by chance. MI analyses that make use of full-cohort data and MI analyses based on substudy data only are described, alongside an intermediate approach in which th ...[more]

PMID: 29870056

Similar Datasets

Project description:Pooling biomarker data across multiple studies allows for examination of a wider exposure range than generally possible in individual studies, evaluation of population subgroups and disease subtypes with more statistical power, and more precise estimation of biomarker-disease associations. However, circulating biomarker measurements often require calibration to a single reference assay prior to pooling due to assay and laboratory variability across studies. We propose several methods for calibrating and combining biomarker data from nested case-control studies when reference assay data are obtained from a subset of controls in each contributing study. Specifically, we describe a two-stage calibration method and two aggregated calibration methods, named the internalized and full calibration methods, to evaluate the main effect of the biomarker exposure on disease risk and whether that association is modified by a potential covariate. The internalized method uses the reference laboratory measurement in the analysis when available and otherwise uses the estimated value derived from calibration models. The full calibration method uses calibrated biomarker measurements for all subjects, including those with reference laboratory measurements. Under the two-stage method, investigators complete study-specific analyses in the first stage followed by meta-analysis in the second stage. Our results demonstrate that the full calibration method is the preferred aggregated approach to minimize bias in point estimates. We also observe that the two-stage and full calibration methods provide similar effect and variance estimates but that their variance estimates are slightly larger than those from the internalized approach. As an illustrative example, we apply the three methods in a pooling project of nested case-control studies to evaluate (i) the association between circulating vitamin D levels and risk of stroke and (ii) how body mass index modifies the association between circulating vitamin D levels and risk of cardiovascular disease.

Project description:BackgroundIncomplete categorical variables with more than two categories are common in public health data. However, most of the existing missing-data methods do not use the information from nonresponse (missingness) probabilities.MethodsWe propose a nearest-neighbour multiple imputation approach to impute a missing at random categorical outcome and to estimate the proportion of each category. The donor set for imputation is formed by measuring distances between each missing value with other non-missing values. The distance function is calculated based on a predictive score, which is derived from two working models: one fits a multinomial logistic regression for predicting the missing categorical outcome (the outcome model) and the other fits a logistic regression for predicting missingness probabilities (the missingness model). A weighting scheme is used to accommodate contributions from two working models when generating the predictive score. A missing value is imputed by randomly selecting one of the non-missing values with the smallest distances. We conduct a simulation to evaluate the performance of the proposed method and compare it with several alternative methods. A real-data application is also presented.ResultsThe simulation study suggests that the proposed method performs well when missingness probabilities are not extreme under some misspecifications of the working models. However, the calibration estimator, which is also based on two working models, can be highly unstable when missingness probabilities for some observations are extremely high. In this scenario, the proposed method produces more stable and better estimates. In addition, proper weights need to be chosen to balance the contributions from the two working models and achieve optimal results for the proposed method.ConclusionsWe conclude that the proposed multiple imputation method is a reasonable approach to dealing with missing categorical outcome data with more than two levels for assessing the distribution of the outcome. In terms of the choices for the working models, we suggest a multinomial logistic regression for predicting the missing outcome and a binary logistic regression for predicting the missingness probability.

Dataset Information

Multiple imputation of missing data in nested case-control and case-cohort studies.

Publications

Multiple imputation of missing data in nested case-control and case-cohort studies.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets