Unknown

Dataset Information

0

Active label cleaning for improved dataset quality under resource constraints.


ABSTRACT: Imperfections in data annotation, known as label noise, are detrimental to the training of machine learning models and have a confounding effect on the assessment of model performance. Nevertheless, employing experts to remove label noise by fully re-annotating large datasets is infeasible in resource-constrained settings, such as healthcare. This work advocates for a data-driven approach to prioritising samples for re-annotation-which we term "active label cleaning". We propose to rank instances according to estimated label correctness and labelling difficulty of each sample, and introduce a simulation framework to evaluate relabelling efficacy. Our experiments on natural images and on a specifically-devised medical imaging benchmark show that cleaning noisy labels mitigates their negative impact on model training, evaluation, and selection. Crucially, the proposed approach enables correcting labels up to 4 × more effectively than typical random selection in realistic conditions, making better use of experts' valuable time for improving dataset quality.

SUBMITTER: Bernhardt M 

PROVIDER: S-EPMC8897392 | biostudies-literature | 2022 Mar

REPOSITORIES: biostudies-literature

altmetric image

Publications


Imperfections in data annotation, known as label noise, are detrimental to the training of machine learning models and have a confounding effect on the assessment of model performance. Nevertheless, employing experts to remove label noise by fully re-annotating large datasets is infeasible in resource-constrained settings, such as healthcare. This work advocates for a data-driven approach to prioritising samples for re-annotation-which we term "active label cleaning". We propose to rank instance  ...[more]

Similar Datasets

| S-EPMC8169648 | biostudies-literature
| S-EPMC5516002 | biostudies-literature
| S-EPMC10496030 | biostudies-literature
| S-EPMC10170808 | biostudies-literature
| S-EPMC10217850 | biostudies-literature
| S-EPMC4324899 | biostudies-literature
| S-EPMC7923608 | biostudies-literature
| S-EPMC3985322 | biostudies-literature
| S-EPMC5764548 | biostudies-literature
| PRJNA930680 | ENA