Dataset Information

Win percentage: a novel measure for assessing the suitability of machine classifiers for biological problems.

ABSTRACT:

Background

Selecting an appropriate classifier for a particular biological application poses a difficult problem for researchers and practitioners alike. In particular, choosing a classifier depends heavily on the features selected. For high-throughput biomedical datasets, feature selection is often a preprocessing step that gives an unfair advantage to the classifiers built with the same modeling assumptions. In this paper, we seek classifiers that are suitable to a particular problem independent of feature selection. We propose a novel measure, called "win percentage", for assessing the suitability of machine classifiers to a particular problem. We define win percentage as the probability a classifier will perform better than its peers on a finite random sample of feature sets, giving each classifier equal opportunity to find suitable features.

Results

First, we illustrate the difficulty in evaluating classifiers after feature selection. We show that several classifiers can each perform statistically significantly better than their peers given the right feature set among the top 0.001% of all feature sets. We illustrate the utility of win percentage using synthetic data, and evaluate six classifiers in analyzing eight microarray datasets representing three diseases: breast cancer, multiple myeloma, and neuroblastoma. After initially using all Gaussian gene-pairs, we show that precise estimates of win percentage (within 1%) can be achieved using a smaller random sample of all feature pairs. We show that for these data no single classifier can be considered the best without knowing the feature set. Instead, win percentage captures the non-zero probability that each classifier will outperform its peers based on an empirical estimate of performance.

Conclusions

Fundamentally, we illustrate that the selection of the most suitable classifier (i.e., one that is more likely to perform better than its peers) not only depends on the dataset and application but also on the thoroughness of feature selection. In particular, win percentage provides a single measurement that could assist users in eliminating or selecting classifiers for their particular application.

SUBMITTER: Parry RM

PROVIDER: S-EPMC3485616 | biostudies-literature | 2012 Mar

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Win percentage: a novel measure for assessing the suitability of machine classifiers for biological problems.

Parry R Mitchell RM Phan John H JH Wang May D MD

BMC bioinformatics 20120321

<h4>Background</h4>Selecting an appropriate classifier for a particular biological application poses a difficult problem for researchers and practitioners alike. In particular, choosing a classifier depends heavily on the features selected. For high-throughput biomedical datasets, feature selection is often a preprocessing step that gives an unfair advantage to the classifiers built with the same modeling assumptions. In this paper, we seek classifiers that are suitable to a particular problem i ...[more]

PMID: 22536905

Similar Datasets

Project description:BackgroundTwitter presents a valuable and relevant social media platform to study the prevalence of information and sentiment on vaping that may be useful for public health surveillance. Machine learning classifiers that identify vaping-relevant tweets and characterize sentiments in them can underpin a Twitter-based vaping surveillance system. Compared with traditional machine learning classifiers that are reliant on annotations that are expensive to obtain, deep learning classifiers offer the advantage of requiring fewer annotated tweets by leveraging the large numbers of readily available unannotated tweets.ObjectiveThis study aims to derive and evaluate traditional and deep learning classifiers that can identify tweets relevant to vaping, tweets of a commercial nature, and tweets with provape sentiments.MethodsWe continuously collected tweets that matched vaping-related keywords over 2 months from August 2018 to October 2018. From this data set of tweets, a set of 4000 tweets was selected, and each tweet was manually annotated for relevance (vape relevant or not), commercial nature (commercial or not), and sentiment (provape or not). Using the annotated data, we derived traditional classifiers that included logistic regression, random forest, linear support vector machine, and multinomial naive Bayes. In addition, using the annotated data set and a larger unannotated data set of tweets, we derived deep learning classifiers that included a convolutional neural network (CNN), long short-term memory (LSTM) network, LSTM-CNN network, and bidirectional LSTM (BiLSTM) network. The unannotated tweet data were used to derive word vectors that deep learning classifiers can leverage to improve performance.ResultsLSTM-CNN performed the best with the highest area under the receiver operating characteristic curve (AUC) of 0.96 (95% CI 0.93-0.98) for relevance, all deep learning classifiers including LSTM-CNN performed better than the traditional classifiers with an AUC of 0.99 (95% CI 0.98-0.99) for distinguishing commercial from noncommercial tweets, and BiLSTM performed the best with an AUC of 0.83 (95% CI 0.78-0.89) for provape sentiment. Overall, LSTM-CNN performed the best across all 3 classification tasks.ConclusionsWe derived and evaluated traditional machine learning and deep learning classifiers to identify vaping-related relevant, commercial, and provape tweets. Overall, deep learning classifiers such as LSTM-CNN had superior performance and had the added advantage of requiring no preprocessing. The performance of these classifiers supports the development of a vaping surveillance system.

Project description:We developed a detailed, whole-body physiologically based pharmacokinetic (PBPK) modeling tool for calculating the distribution of pharmaceutical agents in the various tissues and organs of a human or animal as a function of time. Ordinary differential equations (ODEs) represent the circulation of body fluids through organs and tissues at the macroscopic level, and the biological transport mechanisms and biotransformations within cells and their organelles at the molecular scale. Each major organ in the body is modeled as composed of one or more tissues. Tissues are made up of cells and fluid spaces. The model accounts for the circulation of arterial and venous blood as well as lymph. Since its development was fueled by the need to accurately predict the pharmacokinetic properties of imaging agents, BioDMET is more complex than most PBPK models. The anatomical details of the model are important for the imaging simulation endpoints. Model complexity has also been crucial for quickly adapting the tool to different problems without the need to generate a new model for every problem. When simpler models are preferred, the non-critical compartments can be dynamically collapsed to reduce unnecessary complexity. BioDMET has been used for imaging feasibility calculations in oncology, neurology, cardiology, and diabetes. For this purpose, the time concentration data generated by the model is inputted into a physics-based image simulator to establish imageability criteria. These are then used to define agent and physiology property ranges required for successful imaging. BioDMET has lately been adapted to aid the development of antimicrobial therapeutics. Given a range of built-in features and its inherent flexibility to customization, the model can be used to study a variety of pharmacokinetic and pharmacodynamic problems such as the effects of inter-individual differences and disease-states on drug pharmacokinetics and pharmacodynamics, dosing optimization, and inter-species scaling. While developing a tool to aid imaging agent and drug development, we aimed at accelerating the acceptance and broad use of PBPK modeling by providing a free mechanistic PBPK software that is user friendly, easy to adapt to a wide range of problems even by non-programmers, provided with ready-to-use parameterized models and benchmarking data collected from the peer-reviewed literature.

Dataset Information

Win percentage: a novel measure for assessing the suitability of machine classifiers for biological problems.

Background

Results

Conclusions

Publications

Win percentage: a novel measure for assessing the suitability of machine classifiers for biological problems.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure