Dataset Information

Comparative effectiveness of medical concept embedding for feature engineering in phenotyping.

ABSTRACT:

Objective

Feature engineering is a major bottleneck in phenotyping. Properly learned medical concept embeddings (MCEs) capture the semantics of medical concepts, thus are useful for retrieving relevant medical features in phenotyping tasks. We compared the effectiveness of MCEs learned from knowledge graphs and electronic healthcare records (EHR) data in retrieving relevant medical features for phenotyping tasks.

Materials and methods

We implemented 5 embedding methods including node2vec, singular value decomposition (SVD), LINE, skip-gram, and GloVe with 2 data sources: (1) knowledge graphs obtained from the observational medical outcomes partnership (OMOP) common data model; and (2) patient-level data obtained from the OMOP compatible electronic health records (EHR) from Columbia University Irving Medical Center (CUIMC). We used phenotypes with their relevant concepts developed and validated by the electronic medical records and genomics (eMERGE) network to evaluate the performance of learned MCEs in retrieving phenotype-relevant concepts. Hits@k% in retrieving phenotype-relevant concepts based on a single and multiple seed concept(s) was used to evaluate MCEs.

Results

Among all MCEs, MCEs learned by using node2vec with knowledge graphs showed the best performance. Of MCEs based on knowledge graphs and EHR data, MCEs learned by using node2vec with knowledge graphs and MCEs learned by using GloVe with EHR data outperforms other MCEs, respectively.

Conclusion

MCE enables scalable feature engineering tasks, thereby facilitating phenotyping. Based on current phenotyping practices, MCEs learned by using knowledge graphs constructed by hierarchical relationships among medical concepts outperformed MCEs learned by using EHR data.

SUBMITTER: Lee J

PROVIDER: S-EPMC8206403 | biostudies-literature | 2021 Apr

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Comparative effectiveness of medical concept embedding for feature engineering in phenotyping.

Lee Junghwan J Liu Cong C Kim Jae Hyun JH Butler Alex A Shang Ning N Pang Chao C Natarajan Karthik K Ryan Patrick P Ta Casey C Weng Chunhua C

JAMIA open 20210401 2

<h4>Objective</h4>Feature engineering is a major bottleneck in phenotyping. Properly learned medical concept embeddings (MCEs) capture the semantics of medical concepts, thus are useful for retrieving relevant medical features in phenotyping tasks. We compared the effectiveness of MCEs learned from knowledge graphs and electronic healthcare records (EHR) data in retrieving relevant medical features for phenotyping tasks.<h4>Materials and methods</h4>We implemented 5 embedding methods including n ...[more]

PMID: 34142015

Similar Datasets

Project description:BackgroundThe protection of private data is a key responsibility for research studies that collect identifiable information from study participants. Limiting the scope of data collection and preventing secondary use of the data are effective strategies for managing these risks. An ideal framework for data collection would incorporate feature engineering, a process where secondary features are derived from sensitive raw data in a secure environment without a trusted third party.ObjectiveThis study aimed to compare current approaches based on how they maintain data privacy and the practicality of their implementations. These approaches include traditional approaches that rely on trusted third parties, and cryptographic, secure hardware, and blockchain-based techniques.MethodsA set of properties were defined for evaluating each approach. A qualitative comparison was presented based on these properties. The evaluation of each approach was framed with a use case of sharing geolocation data for biomedical research.ResultsWe found that approaches that rely on a trusted third party for preserving participant privacy do not provide sufficiently strong guarantees that sensitive data will not be exposed in modern data ecosystems. Cryptographic techniques incorporate strong privacy-preserving paradigms but are appropriate only for select use cases or are currently limited because of computational complexity. Blockchain smart contracts alone are insufficient to provide data privacy because transactional data are public. Trusted execution environments (TEEs) may have hardware vulnerabilities and lack visibility into how data are processed. Hybrid approaches combining blockchain and cryptographic techniques or blockchain and TEEs provide promising frameworks for privacy preservation. For reference, we provide a software implementation where users can privately share features of their geolocation data using the hybrid approach combining blockchain with TEEs as a supplement.ConclusionsBlockchain technology and smart contracts enable the development of new privacy-preserving feature engineering methods by obviating dependence on trusted parties and providing immutable, auditable data processing workflows. The overlap between blockchain and cryptographic techniques or blockchain and secure hardware technologies are promising fields for addressing important data privacy needs. Hybrid blockchain and TEE frameworks currently provide practical tools for implementing experimental privacy-preserving applications.

Project description:ObjectivesEvidence-based medicine depends on the timely synthesis of research findings. An important source of synthesized evidence resides in systematic reviews. However, a bottleneck in review production involves dual screening of citations with titles and abstracts to find eligible studies. For this research, we tested the effect of various kinds of textual information (features) on performance of a machine learning classifier. Based on our findings, we propose an automated system to reduce screeing burden, as well as offer quality assurance.MethodsWe built a database of citations from 5 systematic reviews that varied with respect to domain, topic, and sponsor. Consensus judgments regarding eligibility were inferred from published reports. We extracted 5 feature sets from citations: alphabetic, alphanumeric(+), indexing, features mapped to concepts in systematic reviews, and topic models. To simulate a two-person team, we divided the data into random halves. We optimized the parameters of a Bayesian classifier, then trained and tested models on alternate data halves. Overall, we conducted 50 independent tests.ResultsAll tests of summary performance (mean F3) surpassed the corresponding baseline, P<0.0001. The ranks for mean F3, precision, and classification error were statistically different across feature sets averaged over reviews; P-values for Friedman's test were .045, .002, and .002, respectively. Differences in ranks for mean recall were not statistically significant. Alphanumeric(+) features were associated with best performance; mean reduction in screening burden for this feature type ranged from 88% to 98% for the second pass through citations and from 38% to 48% overall.ConclusionsA computer-assisted, decision support system based on our methods could substantially reduce the burden of screening citations for systematic review teams and solo reviewers. Additionally, such a system could deliver quality assurance both by confirming concordant decisions and by naming studies associated with discordant decisions for further consideration.

Dataset Information

Comparative effectiveness of medical concept embedding for feature engineering in phenotyping.

Objective

Materials and methods

Results

Conclusion

Publications

Comparative effectiveness of medical concept embedding for feature engineering in phenotyping.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets