Unknown

Dataset Information

0

Discriminative machine learning for maximal representative subsampling.


ABSTRACT: Biased population samples pose a prevalent problem in the social sciences. Therefore, we present two novel methods that are based on positive-unlabeled learning to mitigate bias. Both methods leverage auxiliary information from a representative data set and train machine learning classifiers to determine the sample weights. The first method, named maximum representative subsampling (MRS), uses a classifier to iteratively remove instances, by assigning a sample weight of 0, from the biased data set until it aligns with the representative one. The second method is a variant of MRS - Soft-MRS - that iteratively adapts sample weights instead of removing samples completely. To assess the effectiveness of our approach, we induced artificial bias in a public census data set and examined the corrected estimates. We compare the performance of our methods against existing techniques, evaluating the ability of sample weights created with Soft-MRS or MRS to minimize differences and improve downstream classification tasks. Lastly, we demonstrate the applicability of the proposed methods in a real-world study of resilience research, exploring the influence of resilience on voting behavior. Through our work, we address the issue of bias in social science, amongst others, and provide a versatile methodology for bias reduction based on machine learning. Based on our experiments, we recommend to use MRS for downstream classification tasks and Soft-MRS for downstream tasks where the relative bias of the dependent variable is relevant.

SUBMITTER: Hauptmann T 

PROVIDER: S-EPMC10684887 | biostudies-literature | 2023 Nov

REPOSITORIES: biostudies-literature

altmetric image

Publications

Discriminative machine learning for maximal representative subsampling.

Hauptmann Tony T   Fellenz Sophie S   Nathan Laksan L   Tüscher Oliver O   Kramer Stefan S  

Scientific reports 20231127 1


Biased population samples pose a prevalent problem in the social sciences. Therefore, we present two novel methods that are based on positive-unlabeled learning to mitigate bias. Both methods leverage auxiliary information from a representative data set and train machine learning classifiers to determine the sample weights. The first method, named maximum representative subsampling (MRS), uses a classifier to iteratively remove instances, by assigning a sample weight of 0, from the biased data s  ...[more]

Similar Datasets

| S-EPMC6694302 | biostudies-literature
| S-EPMC7079150 | biostudies-literature
| S-EPMC7806588 | biostudies-literature
| S-EPMC8598947 | biostudies-literature
| S-EPMC11208939 | biostudies-literature
| S-EPMC7494202 | biostudies-literature
| S-EPMC9117248 | biostudies-literature
| S-EPMC10114129 | biostudies-literature
| S-EPMC5612439 | biostudies-literature
| S-EPMC8290033 | biostudies-literature