Dataset Information

Discriminative machine learning for maximal representative subsampling.

ABSTRACT: Biased population samples pose a prevalent problem in the social sciences. Therefore, we present two novel methods that are based on positive-unlabeled learning to mitigate bias. Both methods leverage auxiliary information from a representative data set and train machine learning classifiers to determine the sample weights. The first method, named maximum representative subsampling (MRS), uses a classifier to iteratively remove instances, by assigning a sample weight of 0, from the biased data set until it aligns with the representative one. The second method is a variant of MRS - Soft-MRS - that iteratively adapts sample weights instead of removing samples completely. To assess the effectiveness of our approach, we induced artificial bias in a public census data set and examined the corrected estimates. We compare the performance of our methods against existing techniques, evaluating the ability of sample weights created with Soft-MRS or MRS to minimize differences and improve downstream classification tasks. Lastly, we demonstrate the applicability of the proposed methods in a real-world study of resilience research, exploring the influence of resilience on voting behavior. Through our work, we address the issue of bias in social science, amongst others, and provide a versatile methodology for bias reduction based on machine learning. Based on our experiments, we recommend to use MRS for downstream classification tasks and Soft-MRS for downstream tasks where the relative bias of the dependent variable is relevant.

SUBMITTER: Hauptmann T

PROVIDER: S-EPMC10684887 | biostudies-literature | 2023 Nov

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Discriminative machine learning for maximal representative subsampling.

Hauptmann Tony T Fellenz Sophie S Nathan Laksan L Tüscher Oliver O Kramer Stefan S

Scientific reports 20231127 1

Biased population samples pose a prevalent problem in the social sciences. Therefore, we present two novel methods that are based on positive-unlabeled learning to mitigate bias. Both methods leverage auxiliary information from a representative data set and train machine learning classifiers to determine the sample weights. The first method, named maximum representative subsampling (MRS), uses a classifier to iteratively remove instances, by assigning a sample weight of 0, from the biased data s ...[more]

PMID: 38017053

Dataset Information

Discriminative machine learning for maximal representative subsampling.

Publications

Discriminative machine learning for maximal representative subsampling.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

Representative subsampling of sedimenting blood.
| S-EPMC6694302 | biostudies-literature

Machine Learning of Discriminative Gate Locations for Clinical Diagnosis.
| S-EPMC7079150 | biostudies-literature

Deep learning encodes robust discriminative neuroimaging representations to outperform standard machine learning.
| S-EPMC7806588 | biostudies-literature

Metaviromic identification of discriminative genomic features in SARS-CoV-2 using machine learning.
| S-EPMC8598947 | biostudies-literature

Discriminative feature analysis of dairy products based on machine learning algorithms and Raman spectroscopy.
| S-EPMC11208939 | biostudies-literature

Generative-Discriminative Complementary Learning.
| S-EPMC7494202 | biostudies-literature

Intelligent career planning via stochastic subsampling reinforcement learning.
| S-EPMC9117248 | biostudies-literature

Nonexercise machine learning models for maximal oxygen uptake prediction in national population surveys.
| S-EPMC10114129 | biostudies-literature

Joint Discriminative and Representative Feature Selection for Alzheimer's Disease Diagnosis.
| S-EPMC5612439 | biostudies-literature

An integrated machine learning framework for a discriminative analysis of schizophrenia using multi-biological data.
| S-EPMC8290033 | biostudies-literature