Dataset Information

Learning statistical models of phenotypes using noisy labeled training data.

ABSTRACT:

Objective

Traditionally, patient groups with a phenotype are selected through rule-based definitions whose creation and validation are time-consuming. Machine learning approaches to electronic phenotyping are limited by the paucity of labeled training datasets. We demonstrate the feasibility of utilizing semi-automatically labeled training sets to create phenotype models via machine learning, using a comprehensive representation of the patient medical record.

Methods

We use a list of keywords specific to the phenotype of interest to generate noisy labeled training data. We train L1 penalized logistic regression models for a chronic and an acute disease and evaluate the performance of the models against a gold standard.

Results

Our models for Type 2 diabetes mellitus and myocardial infarction achieve precision and accuracy of 0.90, 0.89, and 0.86, 0.89, respectively. Local implementations of the previously validated rule-based definitions for Type 2 diabetes mellitus and myocardial infarction achieve precision and accuracy of 0.96, 0.92 and 0.84, 0.87, respectively.We have demonstrated feasibility of learning phenotype models using imperfectly labeled data for a chronic and acute phenotype. Further research in feature engineering and in specification of the keyword list can improve the performance of the models and the scalability of the approach.

Conclusions

Our method provides an alternative to manual labeling for creating training sets for statistical models of phenotypes. Such an approach can accelerate research with large observational healthcare datasets and may also be used to create local phenotype models.

SUBMITTER: Agarwal V

PROVIDER: S-EPMC5070523 | biostudies-literature | 2016 Nov

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Learning statistical models of phenotypes using noisy labeled training data.

Agarwal Vibhu V Podchiyska Tanya T Banda Juan M JM Goel Veena V Leung Tiffany I TI Minty Evan P EP Sweeney Timothy E TE Gyang Elsie E Shah Nigam H NH

Journal of the American Medical Informatics Association : JAMIA 20160512 6

<h4>Objective</h4>Traditionally, patient groups with a phenotype are selected through rule-based definitions whose creation and validation are time-consuming. Machine learning approaches to electronic phenotyping are limited by the paucity of labeled training datasets. We demonstrate the feasibility of utilizing semi-automatically labeled training sets to create phenotype models via machine learning, using a comprehensive representation of the patient medical record.<h4>Methods</h4>We use a list ...[more]

PMID: 27174893

Dataset Information

Learning statistical models of phenotypes using noisy labeled training data.

Objective

Methods

Results

Conclusions

Publications

Learning statistical models of phenotypes using noisy labeled training data.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

Deep kernel learning of dynamical models from high-dimensional noisy data
| S-EPMC9747975 | biostudies-literature

Learning partial differential equations for biological transport models from noisy spatio-temporal data.
| S-EPMC7069483 | biostudies-literature

Augmenting small tabular health data for training prognostic ensemble machine learning models using generative models.
| S-EPMC12661835 | biostudies-literature

Evaluating deep learning models for classifying OCT images with limited data and noisy labels.
| S-EPMC11621707 | biostudies-literature

Statistical analysis of relative labeled mass spectrometry data from complex samples using ANOVA.
| S-EPMC2528956 | biostudies-literature

Machine Learning Classifier Models Can Identify Acute Respiratory Distress Syndrome Phenotypes Using Readily Available Clinical Data.
| S-EPMC7528785 | biostudies-literature

Evaluation of statistical approaches for association testing in noisy drug screening data.
| S-EPMC9118710 | biostudies-literature

Exploration of chemical space with partial labeled noisy student self-training and self-supervised graph embedding.
| S-EPMC9063120 | biostudies-literature

Keyphrase Identification Using Minimal Labeled Data with Hierarchical Context and Transfer Learning.
| S-EPMC10246160 | biostudies-literature

Rapid Synthesis of Cryo-ET Data for Training Deep Learning Models.
| S-EPMC10168359 | biostudies-literature