Dataset Information

A modern maximum-likelihood theory for high-dimensional logistic regression.

ABSTRACT: Students in statistics or data science usually learn early on that when the sample size n is large relative to the number of variables p, fitting a logistic model by the method of maximum likelihood produces estimates that are consistent and that there are well-known formulas that quantify the variability of these estimates which are used for the purpose of statistical inference. We are often told that these calculations are approximately valid if we have 5 to 10 observations per unknown parameter. This paper shows that this is far from the case, and consequently, inferences produced by common software packages are often unreliable. Consider a logistic model with independent features in which n and p become increasingly large in a fixed ratio. We prove that (i) the maximum-likelihood estimate (MLE) is biased, (ii) the variability of the MLE is far greater than classically estimated, and (iii) the likelihood-ratio test (LRT) is not distributed as a χ² The bias of the MLE yields wrong predictions for the probability of a case based on observed values of the covariates. We present a theory, which provides explicit expressions for the asymptotic bias and variance of the MLE and the asymptotic distribution of the LRT. We empirically demonstrate that these results are accurate in finite samples. Our results depend only on a single measure of signal strength, which leads to concrete proposals for obtaining accurate inference in finite samples through the estimate of this measure.

SUBMITTER: Sur P

PROVIDER: S-EPMC6642380 | biostudies-literature | 2019 Jul

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

A modern maximum-likelihood theory for high-dimensional logistic regression.

Sur Pragya P Candès Emmanuel J EJ

Proceedings of the National Academy of Sciences of the United States of America 20190701 29

Students in statistics or data science usually learn early on that when the sample size n is large relative to the number of variables p, fitting a logistic model by the method of maximum likelihood produces estimates that are consistent and that there are well-known formulas that quantify the variability of these estimates which are used for the purpose of statistical inference. We are often told that these calculations are approximately valid if we have 5 to 10 observations per unknown paramet ...[more]

PMID: 31262828

Dataset Information

A modern maximum-likelihood theory for high-dimensional logistic regression.

Publications

A modern maximum-likelihood theory for high-dimensional logistic regression.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

On High-Dimensional Constrained Maximum Likelihood Inference.
| S-EPMC7418862 | biostudies-literature

Efficient posterior sampling for high-dimensional imbalanced logistic regression.
| S-EPMC7799181 | biostudies-literature

On the correspondence of deviances and maximum-likelihood and interval estimates from log-linear to logistic regression modelling.
| S-EPMC7029921 | biostudies-literature

Inference for the Case Probability in High-dimensional Logistic Regression.
| S-EPMC9354733 | biostudies-literature

Penalized logistic regression with low prevalence exposures beyond high dimensional settings.
| S-EPMC6527211 | biostudies-literature

Logistic regression error-in-covariate models for longitudinal high-dimensional covariates.
| S-EPMC7654973 | biostudies-literature

Global and Simultaneous Hypothesis Testing for High-Dimensional Logistic Regression Models.
| S-EPMC8375316 | biostudies-literature

Debiased inference for heterogeneous subpopulations in a high-dimensional logistic regression model.
| S-EPMC10713553 | biostudies-literature

Penalized logistic regression for high-dimensional DNA methylation data with case-control studies.
| S-EPMC3348559 | biostudies-literature

BayesAge: A Maximum Likelihood Algorithm To Predict Epigenetic Age
2024-03-20 | GSE261769 | GEO