Advancing methodologies for applying machine learning and evaluating spatiotemporal models of fine particulate matter (PM2.5) using satellite data over large regions.
Ontology highlight
ABSTRACT: Reconstructing the distribution of fine particulate matter (PM2.5) in space and time, even far from ground monitoring sites, is an important exposure science contribution to epidemiologic analyses of PM2.5 health impacts. Flexible statistical methods for prediction have demonstrated the integration of satellite observations with other predictors, yet these algorithms are susceptible to overfitting the spatiotemporal structure of the training datasets. We present a new approach for predicting PM2.5 using machine-learning methods and evaluating prediction models for the goal of making predictions where they were not previously available. We apply extreme gradient boosting (XGBoost) modeling to predict daily PM2.5 on a 1×1 km2 resolution for a 13 state region in the Northeastern USA for the years 2000-2015 using satellite-derived aerosol optical depth and implement a recursive feature selection to develop a parsimonious model. We demonstrate excellent predictions of withheld observations but also contrast an RMSE of 3.11 μg/m3 in our spatial cross-validation withholding nearby sites versus an overfit RMSE of 2.10 μg/m3 using a more conventional random ten-fold splitting of the dataset. As the field of exposure science moves forward with the use of advanced machine-learning approaches for spatiotemporal modeling of air pollutants, our results show the importance of addressing data leakage in training, overfitting to spatiotemporal structure, and the impact of the predominance of ground monitoring sites in dense urban sub-networks on model evaluation. The strengths of our resultant modeling approach for exposure in epidemiologic studies of PM2.5 include improved efficiency, parsimony, and interpretability with robust validation while still accommodating complex spatiotemporal relationships.
SUBMITTER: Just AC
PROVIDER: S-EPMC7591135 | biostudies-literature |
REPOSITORIES: biostudies-literature
ACCESS DATA