Dataset Information

Impact of the Choice of Cross-Validation Techniques on the Results of Machine Learning-Based Diagnostic Applications.

ABSTRACT:

Objective

With advances in data availability and computing capabilities, artificial intelligence and machine learning technologies have evolved rapidly in recent years. Researchers have taken advantage of these developments in healthcare informatics and created reliable tools to predict or classify diseases using machine learning-based algorithms. To correctly quantify the performance of those algorithms, the standard approach is to use cross-validation, where the algorithm is trained on a training set, and its performance is measured on a validation set. Both datasets should be subject-independent to simulate the expected behavior of a clinical study. This study compares two cross-validation strategies, the subject-wise and the record-wise techniques; the subject-wise strategy correctly mimics the process of a clinical study, while the record-wise strategy does not.

Methods

We started by creating a dataset of smartphone audio recordings of subjects diagnosed with and without Parkinson's disease. This dataset was then divided into training and holdout sets using subject-wise and the record-wise divisions. The training set was used to measure the performance of two classifiers (support vector machine and random forest) to compare six cross-validation techniques that simulated either the subject-wise process or the record-wise process. The holdout set was used to calculate the true error of the classifiers.

Results

The record-wise division and the record-wise cross-validation techniques overestimated the performance of the classifiers and underestimated the classification error.

Conclusions

In a diagnostic scenario, the subject-wise technique is the proper way of estimating a model's performance, and record-wise techniques should be avoided.

SUBMITTER: Tougui I

PROVIDER: S-EPMC8369053 | biostudies-literature |

REPOSITORIES: biostudies-literature

ACCESS DATA

Similar Datasets

Project description:BackgroundDental plaque microbes play a key role in the development of periodontal disease. Numerous high-throughput sequencing studies have generated understanding of the bacterial species associated with both canine periodontal health and disease. Opportunities therefore exist to utilise these bacterial biomarkers to improve disease diagnosis in conscious-based veterinary oral health checks. Here, we demonstrate that molecular techniques, specifically quantitative polymerase chain reaction (qPCR) can be utilised for the detection of microbial biomarkers associated with canine periodontal health and disease.ResultsOver 40 qPCR assays targeting single microbial species associated with canine periodontal health, gingivitis and early periodontitis were developed and validated. These were used to quantify levels of the respective taxa in canine subgingival plaque samples collected across periodontal health (PD0), gingivitis (PD1) and early periodontitis (PD2). When qPCR outputs were compared to the corresponding high-throughput sequencing data there were strong correlations, including a periodontal health associated taxa, Capnocytophaga sp. COT-339 (rs =0.805), and two periodontal disease associated taxa, Peptostreptococcaceae XI [G-4] sp. COT-019 (rs=0.902) and Clostridiales sp. COT-028 (rs=0.802). The best performing models, from five machine learning approaches applied to the qPCR data for these taxa, estimated 85.7% sensitivity and 27.5% specificity for Capnocytophaga sp. COT-339, 74.3% sensitivity and 67.5% specificity for Peptostreptococcaceae XI [G-4] sp. COT-019, and 60.0% sensitivity and 80.0% specificity for Clostridiales sp. COT-028.ConclusionsA qPCR-based approach is an accurate, sensitive, and cost-effective method for detection of microbial biomarkers associated with periodontal health and disease. Taken together, the correlation between qPCR and high-throughput sequencing outputs, and early accuracy insights, indicate the strategy offers a prospective route to the development of diagnostic tools for canine periodontal disease.

Project description:BackgroundTraditional food allergy assessment of anaphylaxis remains limited in accuracy and accessibility. Current methods of anaphylaxis risk assessment are costly with low predictive accuracy. The Tolerance Induction Program (TIP) for anaphylactic patients undergoing TIP immunotherapy produced large-scale diagnostic data across biosimilar proteins, which was used to develop a machine learning model for patient-specific and allergen-specific anaphylaxis assessment. In explanation of construct, this work describes the algorithm design for assignment of peanut allergen score as a quantitative measure of anaphylaxis risk. Secondarily, it confirms the accuracy of the machine learning model for a specific cohort of food anaphylactic children.Methods and resultsMachine learning model design for allergen score prediction utilized 241 individual allergy assays per patient. Accumulation of data across total IgE subdivision served as the basis of data organization. Two regression based Generalized Linear Models (GLM) were utilized to position allergy assessment on a linear scale. The initial model was further tested with sequential patient data over time. A Bayesian method was then used to improve outcomes by calculating the adaptive weights for the results of the two GLMs of peanut allergy score prediction. A linear combination of both provided the final hybrid machine learning prediction algorithm. Specific analysis of peanut anaphylaxis within one endotype model is estimated to predict the severity of possible anaphylactic reaction to peanut with a recall of 95.2% on a dataset of 530 juvenile patients with various food allergies, including but not limited to peanut allergy. Receiver Operating Characteristic analysis yielded over 99% AUC (area under curve) results within peanut allergy prediction.ConclusionsMachine learning algorithm design established from comprehensive molecular allergy data produces high accuracy and recall in anaphylaxis risk assessment. Subsequent design of additional food protein anaphylaxis algorithms is needed to improve the precision and efficiency of clinical food allergy assessment and immunotherapy treatment.