Dataset Information

Variable selection for binary classification using error rate p-values applied to metabolomics data.

ABSTRACT:

Background

Metabolomics datasets are often high-dimensional though only a limited number of variables are expected to be informative given a specific research question. The important task of selecting informative variables can therefore become complex. In this paper we look at discriminating between two groups. Two tasks need to be performed: (i) finding variables which differ between the two groups; and (ii) determining how the selected variables can be used to classify new subjects. We introduce an approach using minimum classification error rates as test statistics to find discriminatory and therefore informative variables. The thresholds resulting in the minimum error rates can be used to classify new subjects. This approach transforms error rates into p-values and is referred to as ERp.

Results

We show that non-parametric hypothesis testing, based on minimum classification error rates as test statistics, can find statistically significantly shifted variables. The discriminatory ability of variables becomes more apparent when error rates are evaluated based on their corresponding p-values, as relatively high error rates can still be statistically significant. ERp can handle unequal and small group sizes, as well as account for the cost of misclassification. ERp retains (if known) or reveals (if unknown) the shift direction, aiding in biological interpretation. The threshold resulting in the minimum error rate can immediately be used to classify new subjects. We use NMR generated metabolomics data to illustrate how ERp is able to discriminate subjects diagnosed with Mycobacterium tuberculosis infected meningitis from a control group. The list of discriminatory variables produced by ERp contains all biologically relevant variables with appropriate shift directions discussed in the original paper from which this data is taken.

Conclusions

ERp performs variable selection and classification, is non-parametric and aids biological interpretation while handling unequal group sizes and misclassification costs. All this is achieved by a single approach which is easy to perform and interpret. ERp has the potential to address many other characteristics of metabolomics data. Future research aims to extend ERp to account for a large proportion of observations below the detection limit, as well as expand on interactions between variables.

SUBMITTER: van Reenen M

PROVIDER: S-EPMC4712617 | biostudies-literature | 2016 Jan

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Variable selection for binary classification using error rate p-values applied to metabolomics data.

van Reenen Mari M Reinecke Carolus J CJ Westerhuis Johan A JA Venter J Hendrik JH

BMC bioinformatics 20160114

<h4>Background</h4>Metabolomics datasets are often high-dimensional though only a limited number of variables are expected to be informative given a specific research question. The important task of selecting informative variables can therefore become complex. In this paper we look at discriminating between two groups. Two tasks need to be performed: (i) finding variables which differ between the two groups; and (ii) determining how the selected variables can be used to classify new subjects. We ...[more]

PMID: 26763892

Dataset Information

Variable selection for binary classification using error rate p-values applied to metabolomics data.

Background

Results

Conclusions

Publications

Variable selection for binary classification using error rate p-values applied to metabolomics data.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

Structured variable selection with q-values.
| S-EPMC3841382 | biostudies-literature

High Dimensional Variable Selection with Error Control.
| S-EPMC5002494 | biostudies-literature

Variable selection methods for identifying predictor interactions in data with repeatedly measured binary outcomes.
| S-EPMC8057419 | biostudies-literature

Integration of Survival and Binary Data for Variable Selection and Prediction: A Bayesian Approach.
| S-EPMC7729996 | biostudies-literature

Stable variable ranking and selection in regularized logistic regression for severely imbalanced big binary data
| S-EPMC9844919 | biostudies-literature

On latent-variable model misspecification in structural measurement error models for binary response.
| S-EPMC3229040 | biostudies-literature

High-dimensional variable selection for ordinal outcomes with error control.
| S-EPMC7820886 | biostudies-literature

Fine mapping and accurate prediction of complex traits using Bayesian Variable Selection models applied to biobank-size data.
| S-EPMC9995454 | biostudies-literature

SEB genotyping: SmartAmp-Eprimer binary code genotyping for complex, highly variable targets applied to HBV.
| S-EPMC9164387 | biostudies-literature

Variable Selection in Untargeted Metabolomics and the Danger of Sparsity.
| S-EPMC7698561 | biostudies-literature