Dataset Information

Detecting the impact of subject characteristics on machine learning-based diagnostic applications.

ABSTRACT: Collection of high-dimensional, longitudinal digital health data has the potential to support a wide-variety of research and clinical applications including diagnostics and longitudinal health tracking. Algorithms that process these data and inform digital diagnostics are typically developed using training and test sets generated from multiple repeated measures collected across a set of individuals. However, the inclusion of repeated measurements is not always appropriately taken into account in the analytical evaluations of predictive performance. The assignment of repeated measurements from each individual to both the training and the test sets ("record-wise" data split) is a common practice and can lead to massive underestimation of the prediction error due to the presence of "identity confounding." In essence, these models learn to identify subjects, in addition to diagnostic signal. Here, we present a method that can be used to effectively calculate the amount of identity confounding learned by classifiers developed using a record-wise data split. By applying this method to several real datasets, we demonstrate that identity confounding is a serious issue in digital health studies and that record-wise data splits for machine learning- based applications need to be avoided.

SUBMITTER: Chaibub Neto E

PROVIDER: S-EPMC6789029 | biostudies-literature | 2019

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Detecting the impact of subject characteristics on machine learning-based diagnostic applications.

Chaibub Neto Elias E Pratap Abhishek A Perumal Thanneer M TM Tummalacherla Meghasyam M Snyder Phil P Bot Brian M BM Trister Andrew D AD Friend Stephen H SH Mangravite Lara L Omberg Larsson L

NPJ digital medicine 20191011

Collection of high-dimensional, longitudinal digital health data has the potential to support a wide-variety of research and clinical applications including diagnostics and longitudinal health tracking. Algorithms that process these data and inform digital diagnostics are typically developed using training and test sets generated from multiple repeated measures collected across a set of individuals. However, the inclusion of repeated measurements is not always appropriately taken into account in ...[more]

PMID: 31633058

Similar Datasets

Project description:Background:To evaluate the association of multiparametric and multiregional MRI-features with key molecular characteristics in patients with newly-diagnosed glioblastoma. Methods:Retrospective data evaluation was approved by the local ethics committee of the University of Heidelberg (ethics approval number: S-320/2012) and informed consent was waived. Preoperative MRI-features were correlated with key molecular characteristics within a single-institutional cohort of 152 patients with newly-diagnosed glioblastoma. Preoperative MRI-features (n=31) included multiparametric (anatomical, diffusion-, perfusion-, and susceptibility-weighted images) and multiregional (contrast enhancing and non-enhancing FLAIR-hyperintense) information with (histogram) quantification of tumor volumes, volume ratios, apparent diffusion coefficients, cerebral blood flow / volume (CBF / CBV) and intratumoral susceptibility signals. Molecular characteristics determined with the Illumina Infinium HumanMethylation450 array included global DNA-methylation subgroups (e.g. mesenchymal (MES), RTK I “PGFRA”, RTK II “classic”), MGMT-promoter methylation status and hallmark copy-number-variations (EGFR-, PDGFRA-, MDM4- and CDK4-amplification; PTEN-, CDKN2A-, NF1- and RB1-loss). Univariate analyses (voxel-lesion-symptom-mapping for tumor location, Wilcoxon-test for all other MRI-features) as well as machine-learning models were applied to study the strength of association and discriminative value of MRI-features for predicting underlying molecular characteristics. Results: There was no tumor location predilection for any of the assessed molecular parameters (permutation-adjusted p>0.05 each). Univariate imaging parameter associations were noted for EGFR amplification and CDKN2A loss, both demonstrating increased nrCBV and nrCBF values (performance of these parameters, as assessed by the area under the ROC curve ranged from 63 to 69%, FDR-adjusted p<0.05, respectively). Subjecting all MRI-features to machine-learning-based classification allowed to predict EGFR amplification status and the RTK II “classic” GB subgroup with a moderate, yet significantly greater accuracy (63% for EGFR [p<0.01] and 61% for RTK II [p=0.01]) than the prediction by chance, whereas prediction accuracy for all other molecular parameters was non-significant (p>0.05, all models). Conclusions: In summary, we found univariate associations between established MRI-features and molecular characteristics, however not of sufficient strength to allow the generation of machine-learning classification models for reliable and clinically meaningful prediction of the assessed molecular characteristics in patients with newly-diagnosed glioblastoma.

Dataset Information

Detecting the impact of subject characteristics on machine learning-based diagnostic applications.

Publications

Detecting the impact of subject characteristics on machine learning-based diagnostic applications.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets