Unknown

Dataset Information

0

Online cross-validation-based ensemble learning.


ABSTRACT: Online estimators update a current estimate with a new incoming batch of data without having to revisit past data thereby providing streaming estimates that are scalable to big data. We develop flexible, ensemble-based online estimators of an infinite-dimensional target parameter, such as a regression function, in the setting where data are generated sequentially by a common conditional data distribution given summary measures of the past. This setting encompasses a wide range of time-series models and, as special case, models for independent and identically distributed data. Our estimator considers a large library of candidate online estimators and uses online cross-validation to identify the algorithm with the best performance. We show that by basing estimates on the cross-validation-selected algorithm, we are asymptotically guaranteed to perform as well as the true, unknown best-performing algorithm. We provide extensions of this approach including online estimation of the optimal ensemble of candidate online estimators. We illustrate excellent performance of our methods using simulations and a real data example where we make streaming predictions of infectious disease incidence using data from a large database. Copyright © 2017 John Wiley & Sons, Ltd.

SUBMITTER: Benkeser D 

PROVIDER: S-EPMC5671383 | biostudies-literature | 2018 Jan

REPOSITORIES: biostudies-literature

altmetric image

Publications

Online cross-validation-based ensemble learning.

Benkeser David D   Ju Cheng C   Lendle Sam S   van der Laan Mark M  

Statistics in medicine 20170504 2


Online estimators update a current estimate with a new incoming batch of data without having to revisit past data thereby providing streaming estimates that are scalable to big data. We develop flexible, ensemble-based online estimators of an infinite-dimensional target parameter, such as a regression function, in the setting where data are generated sequentially by a common conditional data distribution given summary measures of the past. This setting encompasses a wide range of time-series mod  ...[more]

Similar Datasets

| S-EPMC9386585 | biostudies-literature
| S-EPMC10464108 | biostudies-literature
| S-EPMC10798480 | biostudies-literature
| S-EPMC4439506 | biostudies-other
| S-EPMC9049841 | biostudies-literature
| S-EPMC9777370 | biostudies-literature
| S-EPMC8882731 | biostudies-literature
| S-EPMC8748946 | biostudies-literature
| S-EPMC7924479 | biostudies-literature
| S-EPMC10791756 | biostudies-literature