Dataset Information

Application of Efficient Data Cleaning Using Text Clustering for Semistructured Medical Reports to Large-Scale Stool Examination Reports: Methodology Study.

ABSTRACT: BACKGROUND:Since medical research based on big data has become more common, the community's interest and effort to analyze a large amount of semistructured or unstructured text data, such as examination reports, have rapidly increased. However, these large-scale text data are often not readily applicable to analysis owing to typographical errors, inconsistencies, or data entry problems. Therefore, an efficient data cleaning process is required to ensure the veracity of such data. OBJECTIVE:In this paper, we proposed an efficient data cleaning process for large-scale medical text data, which employs text clustering methods and value-converting technique, and evaluated its performance with medical examination text data. METHODS:The proposed data cleaning process consists of text clustering and value-merging. In the text clustering step, we suggested the use of key collision and nearest neighbor methods in a complementary manner. Words (called values) in the same cluster would be expected as a correct value and its wrong representations. In the value-converting step, wrong values for each identified cluster would be converted into their correct value. We applied these data cleaning process to 574,266 stool examination reports produced for parasite analysis at Samsung Medical Center from 1995 to 2015. The performance of the proposed process was examined and compared with data cleaning processes based on a single clustering method. We used OpenRefine 2.7, an open source application that provides various text clustering methods and an efficient user interface for value-converting with common-value suggestion. RESULTS:A total of 1,167,104 words in stool examination reports were surveyed. In the data cleaning process, we discovered 30 correct words and 45 patterns of typographical errors and duplicates. We observed high correction rates for words with typographical errors (98.61%) and typographical error patterns (97.78%). The resulting data accuracy was nearly 100% based on the number of total words. CONCLUSIONS:Our data cleaning process based on the combinatorial use of key collision and nearest neighbor methods provides an efficient cleaning of large-scale text data and hence improves data accuracy.

SUBMITTER: Woo H

PROVIDER: S-EPMC6329435 | biostudies-other | 2019 Jan

REPOSITORIES: biostudies-other

ACCESS DATA

Publications

Application of Efficient Data Cleaning Using Text Clustering for Semistructured Medical Reports to Large-Scale Stool Examination Reports: Methodology Study.

Woo Hyunki H Kim Kyunga K Cha KyeongMin K Lee Jin-Young JY Mun Hansong H Cho Soo Jin SJ Chung Ji In JI Pyo Jeung Hui JH Lee Kun-Chul KC Kang Mira M

Journal of medical Internet research 20190108 1

<h4>Background</h4>Since medical research based on big data has become more common, the community's interest and effort to analyze a large amount of semistructured or unstructured text data, such as examination reports, have rapidly increased. However, these large-scale text data are often not readily applicable to analysis owing to typographical errors, inconsistencies, or data entry problems. Therefore, an efficient data cleaning process is required to ensure the veracity of such data.<h4>Obje ...[more]

PMID: 30622098

Similar Datasets

Project description:BACKGROUND: ESTs and full-length cDNAs represent an invaluable source of evidence for inferring reliable gene structures and discovering potential alternative splicing events. In newly sequenced genomes, these tasks may not be practicable owing to the lack of appropriate training sets. However, when expression data are available, they can be used to build EST clusters related to specific genomic transcribed loci. Common strategies recently employed to this end are based on sequence similarity between transcripts and can lead, in specific conditions, to inconsistent and erroneous clustering. In order to improve the cluster building and facilitate all downstream annotation analyses, we developed a simple genome-based methodology to generate gene-oriented clusters of ESTs when a genomic sequence and a pool of related expressed sequences are provided. Our procedure has been implemented in the software EasyCluster and takes into account the spliced nature of ESTs after an ad hoc genomic mapping. METHODS: EasyCluster uses the well-known GMAP program in order to perform a very quick EST-to-genome mapping in addition to the detection of reliable splice sites. Given a genomic sequence and a pool of ESTs/FL-cDNAs, EasyCluster starts building genomic and EST local databases and runs GMAP. Subsequently, it parses results creating an initial collection of pseudo-clusters by grouping ESTs according to the overlap of their genomic coordinates on the same strand. In the final step, EasyCluster refines the clustering by again running GMAP on each pseudo-cluster and groups together ESTs sharing at least one splice site. RESULTS: The higher accuracy of EasyCluster with respect to other clustering tools has been verified by means of a manually cured benchmark of human EST clusters. Additional datasets including the Unigene cluster Hs.122986 and ESTs related to the human HOXA gene family have also been used to demonstrate the better clustering capability of EasyCluster over current genome-based web service tools such as ASmodeler and BIPASS. EasyCluster has also been used to provide a first compilation of gene-oriented clusters in the Ricinus communis oilseed plant for which no Unigene clusters are yet available, as well as an evaluation of the alternative splicing in this plant species.

Project description:BACKGROUND:Some medications carry increased risk of patient harm when they are given in error. In incident reports, names of the medications that are involved in errors could be found written both in a specific medication field and/or within the free text description of the incident. Analysing only the names of the medications implicated in a specific unstructured medication field does not give information of the associated factors and risk areas, but when analysing unstructured free text descriptions, the information about the medication involved and associated risk factors may be buried within other non-relevant text. Thus, the aim of this study was to extract medication names most commonly used in free text descriptions of medication administration incident reports to identify terms most frequently associated with risk for each of these medications using text mining. METHOD:Free text descriptions of medication administration incidents (n = 72,390) reported in 2016 to the National Reporting and Learning System for England and Wales were analysed using SAS® Text miner. Analysis included text parsing and filtering free text to identify most commonly mentioned medications, followed by concept linking, and clustering to identify terms associated with commonly mentioned medications and the associated risk areas. RESULTS:The following risk areas related to medications were identified: 1. Allergic reactions to antibacterial drugs, 2. Intravenous administration of antibacterial drugs, 3. Fentanyl patches, 4. Checking and documenting of analgesic doses, 5. Checking doses of anticoagulants, 6. Insulin doses and blood glucose, 7. Administration of intravenous infusions. CONCLUSIONS:Interventions to increase medication administration safety should focus on checking patient allergies and medication doses, especially for intravenous and transdermal medications. High-risk medications include insulin, analgesics, antibacterial drugs, anticoagulants, and potassium chloride. Text mining may be useful for analysing large free text datasets and should be developed further.

Dataset Information

Application of Efficient Data Cleaning Using Text Clustering for Semistructured Medical Reports to Large-Scale Stool Examination Reports: Methodology Study.

Publications

Application of Efficient Data Cleaning Using Text Clustering for Semistructured Medical Reports to Large-Scale Stool Examination Reports: Methodology Study.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure