Unknown

Dataset Information

0

The effect of data cleaning on record linkage quality.


ABSTRACT:

Background

Within the field of record linkage, numerous data cleaning and standardisation techniques are employed to ensure the highest quality of links. While these facilities are common in record linkage software packages and are regularly deployed across record linkage units, little work has been published demonstrating the impact of data cleaning on linkage quality.

Methods

A range of cleaning techniques was applied to both a synthetically generated dataset and a large administrative dataset previously linked to a high standard. The effect of these changes on linkage quality was investigated using pairwise F-measure to determine quality.

Results

Data cleaning made little difference to the overall linkage quality, with heavy cleaning leading to a decrease in quality. Further examination showed that decreases in linkage quality were due to cleaning techniques typically reducing the variability - although correct records were now more likely to match, incorrect records were also more likely to match, and these incorrect matches outweighed the correct matches, reducing quality overall.

Conclusions

Data cleaning techniques have minimal effect on linkage quality. Care should be taken during the data cleaning process.

SUBMITTER: Randall SM 

PROVIDER: S-EPMC3688507 | biostudies-literature | 2013 Jun

REPOSITORIES: biostudies-literature

altmetric image

Publications

The effect of data cleaning on record linkage quality.

Randall Sean M SM   Ferrante Anna M AM   Boyd James H JH   Semmens James B JB  

BMC medical informatics and decision making 20130605


<h4>Background</h4>Within the field of record linkage, numerous data cleaning and standardisation techniques are employed to ensure the highest quality of links. While these facilities are common in record linkage software packages and are regularly deployed across record linkage units, little work has been published demonstrating the impact of data cleaning on linkage quality.<h4>Methods</h4>A range of cleaning techniques was applied to both a synthetically generated dataset and a large adminis  ...[more]

Similar Datasets

| S-EPMC9403736 | biostudies-literature
| S-EPMC10935812 | biostudies-literature
| S-EPMC5005943 | biostudies-literature
| S-EPMC6759179 | biostudies-literature
| S-EPMC6200350 | biostudies-literature
| S-EPMC5808931 | biostudies-literature
| S-EPMC4267104 | biostudies-other
| S-EPMC5905951 | biostudies-literature
| S-EPMC4545338 | biostudies-literature
| S-EPMC4487251 | biostudies-other