Dataset Information

Developing a Standardization Algorithm for Categorical Laboratory Tests for Clinical Big Data Research: Retrospective Study.

ABSTRACT: BACKGROUND:Data standardization is essential in electronic health records (EHRs) for both clinical practice and retrospective research. However, it is still not easy to standardize EHR data because of nonidentical duplicates, typographical errors, or inconsistencies. To overcome this drawback, standardization efforts have been undertaken for collecting data in a standardized format as well as for curating the stored data in EHRs. To perform clinical big data research, the stored data in EHR should be standardized, starting from laboratory results, given their importance. However, most of the previous efforts have been based on labor-intensive manual methods. OBJECTIVE:We aimed to develop an automatic standardization method for eliminating the noises of categorical laboratory data, grouping, and mapping of cleaned data using standard terminology. METHODS:We developed a method called standardization algorithm for laboratory test-categorical result (SALT-C) that can process categorical laboratory data, such as pos +, 250 4+ (urinalysis results), and reddish (urinalysis color results). SALT-C consists of five steps. First, it applies data cleaning rules to categorical laboratory data. Second, it categorizes the cleaned data into 5 predefined groups (urine color, urine dipstick, blood type, presence-finding, and pathogenesis tests). Third, all data in each group are vectorized. Fourth, similarity is calculated between the vectors of data and those of each value in the predefined value sets. Finally, the value closest to the data is assigned. RESULTS:The performance of SALT-C was validated using 59,213,696 data points (167,938 unique values) generated over 23 years from a tertiary hospital. Apart from the data whose original meaning could not be interpreted correctly (eg, ** and _^), SALT-C mapped unique raw data to the correct reference value for each group with accuracy of 97.6% (123/126; urine color tests), 97.5% (198/203; (urine dipstick tests), 95% (53/56; blood type tests), 99.68% (162,291/162,805; presence-finding tests), and 99.61% (4643/4661; pathogenesis tests). CONCLUSIONS:The proposed SALT-C successfully standardized the categorical laboratory test results with high reliability. SALT-C can be beneficial for clinical big data research by reducing laborious manual standardization efforts.

SUBMITTER: Kim M

PROVIDER: S-EPMC6740165 | biostudies-literature | 2019 Aug

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Developing a Standardization Algorithm for Categorical Laboratory Tests for Clinical Big Data Research: Retrospective Study.

Kim Mina M Shin Soo-Yong SY Kang Mira M Yi Byoung-Kee BK Chang Dong Kyung DK

JMIR medical informatics 20190829 3

<h4>Background</h4>Data standardization is essential in electronic health records (EHRs) for both clinical practice and retrospective research. However, it is still not easy to standardize EHR data because of nonidentical duplicates, typographical errors, or inconsistencies. To overcome this drawback, standardization efforts have been undertaken for collecting data in a standardized format as well as for curating the stored data in EHRs. To perform clinical big data research, the stored data in ...[more]

PMID: 31469075

Dataset Information

Developing a Standardization Algorithm for Categorical Laboratory Tests for Clinical Big Data Research: Retrospective Study.

Publications

Developing a Standardization Algorithm for Categorical Laboratory Tests for Clinical Big Data Research: Retrospective Study.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

Federated learning based futuristic biomedical big-data analysis and standardization.
| S-EPMC10550167 | biostudies-literature

LabRS: A Rosetta stone for retrospective standardization of clinical laboratory test results.
| S-EPMC6251547 | biostudies-literature

Variance estimation in tests of clustered categorical data with informative cluster size.
| S-EPMC11220780 | biostudies-literature

Including household effects in Big Data research: the experience of building a longitudinal residence algorithm using linked administrative data in Wales.
| S-EPMC7299488 | biostudies-literature

lab2clean: a novel algorithm for automated cleaning of retrospective clinical laboratory results data for secondary uses.
| S-EPMC11370074 | biostudies-literature

Genotype and phenotype data standardization, utilization and integration in the big data era for agricultural sciences.
| S-EPMC10712715 | biostudies-literature

A New Strategy for Evaluating the Quality of Laboratory Results for Big Data Research: Using External Quality Assessment Survey Data (2010-2020).
| S-EPMC10151270 | biostudies-literature

Is dementia research ready for big data approaches?
| S-EPMC4476175 | biostudies-literature

An undergraduate genome research course using "big data".
| S-EPMC10458672 | biostudies-literature

Tests on asymmetry for ordered categorical variables.
| S-EPMC9041824 | biostudies-literature