Dataset Information

ShinyLearner: A containerized benchmarking tool for machine-learning classification of tabular data.

ABSTRACT: Classification algorithms assign observations to groups based on patterns in data. The machine-learning community have developed myriad classification algorithms, which are used in diverse life science research domains. Algorithm choice can affect classification accuracy dramatically, so it is crucial that researchers optimize the choice of which algorithm(s) to apply in a given research domain on the basis of empirical evidence. In benchmark studies, multiple algorithms are applied to multiple datasets, and the researcher examines overall trends. In addition, the researcher may evaluate multiple hyperparameter combinations for each algorithm and use feature selection to reduce data dimensionality. Although software implementations of classification algorithms are widely available, robust benchmark comparisons are difficult to perform when researchers wish to compare algorithms that span multiple software packages. Programming interfaces, data formats, and evaluation procedures differ across software packages; and dependency conflicts may arise during installation. To address these challenges, we created ShinyLearner, an open-source project for integrating machine-learning packages into software containers. ShinyLearner provides a uniform interface for performing classification, irrespective of the library that implements each algorithm, thus facilitating benchmark comparisons. In addition, ShinyLearner enables researchers to optimize hyperparameters and select features via nested cross-validation; it tracks all nested operations and generates output files that make these steps transparent. ShinyLearner includes a Web interface to help users more easily construct the commands necessary to perform benchmark comparisons. ShinyLearner is freely available at https://github.com/srp33/ShinyLearner. This software is a resource to researchers who wish to benchmark multiple classification or feature-selection algorithms on a given dataset. We hope it will serve as example of combining the benefits of software containerization with a user-friendly approach.

SUBMITTER: Piccolo SR

PROVIDER: S-EPMC7131989 | biostudies-literature | 2020 Apr

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

ShinyLearner: A containerized benchmarking tool for machine-learning classification of tabular data.

Piccolo Stephen R SR Lee Terry J TJ Suh Erica E Hill Kimball K

GigaScience 20200401 4

<h4>Background</h4>Classification algorithms assign observations to groups based on patterns in data. The machine-learning community have developed myriad classification algorithms, which are used in diverse life science research domains. Algorithm choice can affect classification accuracy dramatically, so it is crucial that researchers optimize the choice of which algorithm(s) to apply in a given research domain on the basis of empirical evidence. In benchmark studies, multiple algorithms are a ...[more]

PMID: 32249316

Similar Datasets

Project description:BackgroundWhile efforts to establish best practices with functional near infrared spectroscopy (fNIRS) signal processing have been published, there are still no community standards for applying machine learning to fNIRS data. Moreover, the lack of open source benchmarks and standard expectations for reporting means that published works often claim high generalisation capabilities, but with poor practices or missing details in the paper. These issues make it hard to evaluate the performance of models when it comes to choosing them for brain-computer interfaces.MethodsWe present an open-source benchmarking framework, BenchNIRS, to establish a best practice machine learning methodology to evaluate models applied to fNIRS data, using five open access datasets for brain-computer interface (BCI) applications. The BenchNIRS framework, using a robust methodology with nested cross-validation, enables researchers to optimise models and evaluate them without bias. The framework also enables us to produce useful metrics and figures to detail the performance of new models for comparison. To demonstrate the utility of the framework, we present a benchmarking of six baseline models [linear discriminant analysis (LDA), support-vector machine (SVM), k-nearest neighbours (kNN), artificial neural network (ANN), convolutional neural network (CNN), and long short-term memory (LSTM)] on the five datasets and investigate the influence of different factors on the classification performance, including: number of training examples and size of the time window of each fNIRS sample used for classification. We also present results with a sliding window as opposed to simple classification of epochs, and with a personalised approach (within subject data classification) as opposed to a generalised approach (unseen subject data classification).Results and discussionResults show that the performance is typically lower than the scores often reported in literature, and without great differences between models, highlighting that predicting unseen data remains a difficult task. Our benchmarking framework provides future authors, who are achieving significant high classification scores, with a tool to demonstrate the advances in a comparable way. To complement our framework, we contribute a set of recommendations for methodology decisions and writing papers, when applying machine learning to fNIRS data.

Project description:BackgroundRetinal vein occlusion (RVO) is a leading cause of vision loss globally. Routine health check-up data-including demographic information, medical history, and laboratory test results-are commonly utilized in clinical settings for disease risk assessment. This study aimed to develop a machine learning model to predict RVO risk in the general population using such tabular health data, without requiring coding expertise or retinal imaging.MethodsWe utilized data from the Korea National Health and Nutrition Examination Surveys (KNHANES) collected between 2017 and 2020 to develop the RVO prediction model, with external validation performed using independent data from KNHANES 2021. Model construction was conducted using Orange Data Mining, an open-source, code-free, component-based tool with a user-friendly interface, and Google Vertex AI. An easy-to-use oversampling function was employed to address class imbalance, enhancing the usability of the workflow. Various machine learning algorithms were trained by incorporating all features from the health check-up data in the development set. The primary outcome was the area under the receiver operating characteristic curve (AUC) for identifying RVO.ResultsAll machine learning training was completed without the need for coding experience. An artificial neural network (ANN) with a ReLU activation function, developed using Orange Data Mining, demonstrated superior performance, achieving an AUC of 0.856 (95% confidence interval [CI], 0.835-0.875) in internal validation and 0.784 (95% CI, 0.763-0.803) in external validation. The ANN outperformed logistic regression and Google Vertex AI models, though differences were not statistically significant in internal validation. In external validation, the ANN showed a marginally significant improvement over logistic regression (P = 0.044), with no significant difference compared to Google Vertex AI. Key predictive variables included age, household income, and blood pressure-related factors.ConclusionThis study demonstrates the feasibility of developing an accessible, cost-effective RVO risk prediction tool using health check-up data and no-code machine learning platforms. Such a tool has the potential to enhance early detection and preventive strategies in general healthcare settings, thereby improving patient outcomes.

Project description:ObjectiveClassification tasks are an open challenge in the field of biomedicine. While several machine-learning techniques exist to accomplish this objective, several peculiarities associated with biomedical data, especially when it comes to omics measurements, prevent their use or good performance achievements. Omics approaches aim to understand a complex biological system through systematic analysis of its content at the molecular level. On the other hand, omics data are heterogeneous, sparse and affected by the classical "curse of dimensionality" problem, i.e. having much fewer observation, samples (n) than omics features (p). Furthermore, a major problem with multi-omics data is the imbalance either at the class or feature level. The objective of this work is to study whether feature extraction and/or feature selection techniques can improve the performances of classification machine-learning algorithms on omics measurements.MethodsAmong all omics, metabolomics has emerged as a powerful tool in cancer research, facilitating a deeper understanding of the complex metabolic landscape associated with tumorigenesis and tumor progression. Thus, we selected three publicly available metabolomics datasets, and we applied several feature extraction techniques both linear and non-linear, coupled or not with feature selection methods, and evaluated the performances regarding patient classification in the different configurations for the three datasets.ResultsWe provide general workflow and guidelines on when to use those techniques depending on the characteristics of the data available. To further test the extension of our approach to other omics data, we have included a transcriptomics and a proteomics data. Overall, for all datasets, we showed that applying supervised feature selection improves the performances of feature extraction methods for classification purposes. Scripts used to perform all analyses are available at: https://github.com/Plant-Net/Metabolomic_project/.

Dataset Information

ShinyLearner: A containerized benchmarking tool for machine-learning classification of tabular data.

Publications

ShinyLearner: A containerized benchmarking tool for machine-learning classification of tabular data.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets