Unknown

Dataset Information

0

Hierarchical confounder discovery in the experiment-machine learning cycle.


ABSTRACT: The promise of machine learning (ML) to extract insights from high-dimensional datasets is tempered by confounding variables. It behooves scientists to determine if a model has extracted the desired information or instead fallen prey to bias. Due to features of natural phenomena and experimental design constraints, bioscience datasets are often organized in nested hierarchies that obfuscate the origins of confounding effects and render confounder amelioration methods ineffective. We propose a non-parametric statistical method called the rank-to-group (RTG) score that identifies hierarchical confounder effects in raw data and ML-derived embeddings. We show that RTG scores correctly assign the effects of hierarchical confounders when linear methods fail. In a public biomedical image dataset, we discover unreported effects of experimental design. We then use RTG scores to discover crossmodal correlated variability in a multi-phenotypic biological dataset. This approach should be generally useful in experiment-analysis cycles and to ensure confounder robustness in ML models.

SUBMITTER: Rogozhnikov A 

PROVIDER: S-EPMC9024009 | biostudies-literature | 2022 Apr

REPOSITORIES: biostudies-literature

altmetric image

Publications

Hierarchical confounder discovery in the experiment-machine learning cycle.

Rogozhnikov Alex A   Ramkumar Pavan P   Bedi Rishi R   Kato Saul S   Escola G Sean GS  

Patterns (New York, N.Y.) 20220222 4


The promise of machine learning (ML) to extract insights from high-dimensional datasets is tempered by confounding variables. It behooves scientists to determine if a model has extracted the desired information or instead fallen prey to bias. Due to features of natural phenomena and experimental design constraints, bioscience datasets are often organized in nested hierarchies that obfuscate the origins of confounding effects and render confounder amelioration methods ineffective. We propose a no  ...[more]

Similar Datasets

2022-10-01 | GSE200096 | GEO
2020-09-01 | E-MTAB-9501 | biostudies-arrayexpress
| S-EPMC6137445 | biostudies-other
| S-EPMC7875251 | biostudies-literature
| S-EPMC7660369 | biostudies-literature
| PRJNA822943 | ENA
| S-EPMC10257182 | biostudies-literature
| S-EPMC7435601 | biostudies-literature
| S-EPMC9310517 | biostudies-literature
| S-EPMC6428806 | biostudies-literature