Dataset Information

Challenges of machine learning model validation using correlated behaviour data: Evaluation of cross-validation strategies and accuracy measures.

ABSTRACT: Automated monitoring of the movements and behaviour of animals is a valuable research tool. Recently, machine learning tools were applied to many species to classify units of behaviour. For the monitoring of wild species, collecting enough data for training models might be problematic, thus we examine how machine learning models trained on one species can be applied to another closely related species with similar behavioural conformation. We contrast two ways to calculate accuracies, termed here as overall and threshold accuracy, because the field has yet to define solid standards for reporting and measuring classification performances. We measure 21 dogs and 7 wolves, and find that overall accuracies are between 51 and 60% for classifying 8 behaviours (lay, sit, stand, walk, trot, run, eat, drink) when training and testing data are from the same species and between 41 and 51% when training and testing is cross-species. We show that using data from dogs to predict the behaviour of wolves is feasible. We also show that optimising the model for overall accuracy leads to similar overall and threshold accuracies, while optimizing for threshold accuracy leads to threshold accuracies well above 80%, but yielding very low overall accuracies, often below the chance level. Moreover, we show that the most common method for dividing the data between training and testing data (random selection of test data) overestimates the accuracy of models when applied to data of new specimens. Consequently, we argue that for the most common goals of animal behaviour recognition overall accuracy should be the preferred metric. Considering, that often the goal is to collect movement data without other methods of observation, we argue that training data and testing data should be divided by individual and not randomly.

SUBMITTER: Ferdinandy B

PROVIDER: S-EPMC7371169 | biostudies-literature | 2020

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Challenges of machine learning model validation using correlated behaviour data: Evaluation of cross-validation strategies and accuracy measures.

Ferdinandy Bence B Gerencsér Linda L Corrieri Luca L Perez Paula P Újváry Dóra D Csizmadia Gábor G Miklósi Ádám Á

PloS one 20200720 7

Automated monitoring of the movements and behaviour of animals is a valuable research tool. Recently, machine learning tools were applied to many species to classify units of behaviour. For the monitoring of wild species, collecting enough data for training models might be problematic, thus we examine how machine learning models trained on one species can be applied to another closely related species with similar behavioural conformation. We contrast two ways to calculate accuracies, termed here ...[more]

PMID: 32687528

Similar Datasets

Project description:BackgroundEbola virus disease (EVD) plagues low-resource and difficult-to-access settings. Machine learning prognostic models and mHealth tools could improve the understanding and use of evidence-based care guidelines in such settings. However, data incompleteness and lack of interoperability limit model generalizability. This study harmonizes diverse datasets from the 2014-16 EVD epidemic and generates several prognostic models incorporated into the novel Ebola Care Guidelines app that provides informed access to recommended evidence-based guidelines.MethodsMultivariate logistic regression was applied to investigate survival outcomes in 470 patients admitted to five Ebola treatment units in Liberia and Sierra Leone at various timepoints during 2014-16. We generated a parsimonious model (viral load, age, temperature, bleeding, jaundice, dyspnea, dysphagia, and time-to-presentation) and several fallback models for when these variables are unavailable. All were externally validated against two independent datasets and compared to further models including expert observational wellness assessments. Models were incorporated into an app highlighting the signs/symptoms with the largest contribution to prognosis.FindingsThe parsimonious model approached the predictive power of observational assessments by experienced clinicians (Area-Under-the-Curve, AUC = 0.70-0.79, accuracy = 0.64-0.74) and maintained its performance across subcohorts with different healthcare seeking behaviors. Age and viral load contributed > 5-fold the weighting of other features and including them in a minimal model had a similar AUC, albeit at the cost of specificity.InterpretationClinically guided prognostic models can recapitulate clinical expertise and be useful when such expertise is unavailable. Incorporating these models into mHealth tools may facilitate their interpretation and provide informed access to comprehensive clinical guidelines.FundingHoward Hughes Medical Institute, US National Institutes of Health, Bill & Melinda Gates Foundation, International Medical Corps, UK Department for International Development, and GOAL Global.

Project description:In recent years machine learning has transformed many aspects of the drug discovery process including small molecule design for which the prediction of the bioactivity is an integral part. Leveraging structural information about the interactions between a small molecule and its protein target has great potential for downstream machine learning scoring approaches, but is fundamentally limited by the accuracy with which protein:ligand complex structures can be predicted in a reliable and automated fashion. With the goal of finding practical approaches to generating useful kinase:inhibitor complex geometries for downstream machine learning scoring approaches, we present a kinase-centric docking benchmark assessing the performance of different classes of docking and pose selection strategies to assess how well experimentally observed binding modes are recapitulated in a realistic cross-docking scenario. The assembled benchmark data set focuses on the well-studied protein kinase family and comprises a subset of 589 protein structures co-crystallized with 423 ATP-competitive ligands. We find that the docking methods biased by the co-crystallized ligand-utilizing shape overlap with or without maximum common substructure matching-are more successful in recovering binding poses than standard physics-based docking alone. Also, docking into multiple structures significantly increases the chance to generate a low RMSD docking pose. Docking utilizing an approach that combines all three methods (Posit) into structures with the most similar co-crystallized ligands according to shape and electrostatics proofed to be the most efficient way to reproduce binding poses achieving a success rate of 66.9 % across all included systems. The studied docking and pose selection strategies-which utilize the OpenEye Toolkit-were implemented into pipelines of the KinoML framework allowing automated and reliable protein:ligand complex generation for future downstream machine learning tasks. Although focused on protein kinases, we believe the general findings can also be transferred to other protein families.

Project description:With the rise in whole slide scanner technology, large numbers of tissue slides are being scanned and represented and archived digitally. While digital pathology has substantial implications for telepathology, second opinions, and education there are also huge research opportunities in image computing with this new source of "big data". It is well known that there is fundamental prognostic data embedded in pathology images. The ability to mine "sub-visual" image features from digital pathology slide images, features that may not be visually discernible by a pathologist, offers the opportunity for better quantitative modeling of disease appearance and hence possibly improved prediction of disease aggressiveness and patient outcome. However the compelling opportunities in precision medicine offered by big digital pathology data come with their own set of computational challenges. Image analysis and computer assisted detection and diagnosis tools previously developed in the context of radiographic images are woefully inadequate to deal with the data density in high resolution digitized whole slide images. Additionally there has been recent substantial interest in combining and fusing radiologic imaging and proteomics and genomics based measurements with features extracted from digital pathology images for better prognostic prediction of disease aggressiveness and patient outcome. Again there is a paucity of powerful tools for combining disease specific features that manifest across multiple different length scales. The purpose of this review is to discuss developments in computational image analysis tools for predictive modeling of digital pathology images from a detection, segmentation, feature extraction, and tissue classification perspective. We discuss the emergence of new handcrafted feature approaches for improved predictive modeling of tissue appearance and also review the emergence of deep learning schemes for both object detection and tissue classification. We also briefly review some of the state of the art in fusion of radiology and pathology images and also combining digital pathology derived image measurements with molecular "omics" features for better predictive modeling. The review ends with a brief discussion of some of the technical and computational challenges to be overcome and reflects on future opportunities for the quantitation of histopathology.

Project description:Background:Machine learning methods and conventions are increasingly employed for the analysis of large, complex biomedical data sets, including genome-wide association studies (GWAS). Reproducibility of machine learning analyses of GWAS can be hampered by biological and statistical factors, particularly so for the investigation of non-additive genetic interactions. Application of traditional cross validation to a GWAS data set may result in poor consistency between the training and testing data set splits due to an imbalance of the interaction genotypes relative to the data as a whole. We propose a new cross validation method, proportional instance cross validation (PICV), that preserves the original distribution of an independent variable when splitting the data set into training and testing partitions. Results:We apply PICV to simulated GWAS data with epistatic interactions of varying minor allele frequencies and prevalences and compare performance to that of a traditional cross validation procedure in which individuals are randomly allocated to training and testing partitions. Sensitivity and positive predictive value are significantly improved across all tested scenarios for PICV compared to traditional cross validation. We also apply PICV to GWAS data from a study of primary open-angle glaucoma to investigate a previously-reported interaction, which fails to significantly replicate; PICV however improves the consistency of testing and training results. Conclusions:Application of traditional machine learning procedures to biomedical data may require modifications to better suit intrinsic characteristics of the data, such as the potential for highly imbalanced genotype distributions in the case of epistasis detection. The reproducibility of genetic interaction findings can be improved by considering this variable imbalance in cross validation implementation, such as with PICV. This approach may be extended to problems in other domains in which imbalanced variable distributions are a concern.

Project description:Corona Virus Disease 2019 (COVID-19) pandemic has increased the importance of Virtual Learning Environments (VLEs) instigating students to study from their homes. Every day a tremendous amount of data is generated when students interact with VLEs to perform different activities and access learning material. To make the generated data useful, it must be processed and managed by the proper machine learning (ML) algorithm. ML algorithms' applications are many folds with Education Data Mining (EDM) and Learning Analytics (LA) as their major fields. ML algorithms are commonly used to process raw data to discover hidden patterns and construct a model to make future predictions, such as predicting students' performance, dropouts, engagement, etc. However, in VLE, it is important to select the right and most applicable ML algorithm to give the best performance results. In this study, we aim to improve those ML and DL algorithms' performance that give an inferior performance in terms of performance, accuracy, precision, recall, and F1 score. Several ML algorithms were applied on Open University Learning Analytics (OULA) dataset to reveal which one offers the best results in terms of performance, accuracy, precision, recall, and F1 score. Two popular ML algorithms called Decision Tree (DT) and Feed-Forward Neural Network (FFNN) provided unsatisfactory results. They were selected and experimented with various techniques such as grid search cross-validation, adaptive boosting, extreme gradient boosting, early stopping, feature engineering, and dropping inactive neurons to improve their performance scores. Moreover, we also determined the feature weights/importance in predicting the students' study performance, leading to the design and development of the adaptive learning system. The ML techniques and the methods used in this research study can be used by instructors/administrators to optimize learning content and provide informed guidance to students, thus improving their learning experience and making it exciting and adaptive.

Project description:BackgroundClinical parameter-based nomograms and staging systems provide limited information for the prediction of survival in intrahepatic cholangiocarcinoma (ICC) patients. In this study, we developed a methylation signature that precisely predicts overall survival (OS) after surgery.MethodsAn epigenome-wide study of DNA methylation based on whole-genome bisulfite sequencing (WGBS) was conducted for two independent cohorts (discovery cohort, n=164; validation cohort, n=170) from three hepatobiliary centers in China. By referring to differentially methylated regions (DMRs), we proposed the concept of prognostically methylated regions (PMRs), which were composed of consecutive prognostically methylated CpGs (PMCs). Using machine learning strategies (Random Forest and the least absolute shrinkage and selector regression), a prognostic methylation score (PMS) was constructed based on 14 PMRs in the discovery cohort and confirmed in the validation cohort.ResultsThe C-indices of the PMS for predicting OS in the discovery and validation cohorts were 0.79 and 0.74, respectively. In the whole cohort, the PMS was an independent predictor of OS [hazard ratio (HR) =8.12; 95% confidence interval (CI): 5.48-12.04; P<0.001], and the C-index (0.78) of the PMS was significantly higher than that of the Johns Hopkins University School of Medicine (JHUSM) nomogram (0.69, P<0.001), the Eastern Hepatobiliary Surgery Hospital (EHBSH) nomogram (0.67, P<0.001), American Joint Committee on Cancer (AJCC) tumor-node-metastasis (TNM) staging system (0.61, P<0.001), and MEGNA prognostic score (0.60, P<0.001). The patients in quartile 4 of PMS could benefit from adjuvant therapy (AT) (HR =0.54; 95% CI: 0.32-0.91; log-rank P=0.043), whereas those in the quartiles 1-3 could not. However, other nomograms and staging system failed to do so. Further analyses of potential mechanisms showed that the PMS was associated with tumor biological behaviors, pathway activation, and immune microenvironment.ConclusionsThe PMS could improve the prognostic accuracy and identify patients who would benefit from AT for ICC patients, and might facilitate decisions in treatment of ICC patients.

Dataset Information

Challenges of machine learning model validation using correlated behaviour data: Evaluation of cross-validation strategies and accuracy measures.

Publications

Challenges of machine learning model validation using correlated behaviour data: Evaluation of cross-validation strategies and accuracy measures.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets