Dataset Information

Effect of training-sample size and classification difficulty on the accuracy of genomic predictors.

ABSTRACT: As part of the MicroArray Quality Control (MAQC)-II project, this analysis examines how the choice of univariate feature-selection methods and classification algorithms may influence the performance of genomic predictors under varying degrees of prediction difficulty represented by three clinically relevant endpoints.We used gene-expression data from 230 breast cancers (grouped into training and independent validation sets), and we examined 40 predictors (five univariate feature-selection methods combined with eight different classifiers) for each of the three endpoints. Their classification performance was estimated on the training set by using two different resampling methods and compared with the accuracy observed in the independent validation set.A ranking of the three classification problems was obtained, and the performance of 120 models was estimated and assessed on an independent validation set. The bootstrapping estimates were closer to the validation performance than were the cross-validation estimates. The required sample size for each endpoint was estimated, and both gene-level and pathway-level analyses were performed on the obtained models.We showed that genomic predictor accuracy is determined largely by an interplay between sample size and classification difficulty. Variations on univariate feature-selection methods and choice of classification algorithm have only a modest impact on predictor performance, and several statistically equally good predictors can be developed for any given classification problem.

SUBMITTER: Popovici V

PROVIDER: S-EPMC2880423 | biostudies-literature | 2010

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Effect of training-sample size and classification difficulty on the accuracy of genomic predictors.

Popovici Vlad V Chen Weijie W Gallas Brandon G BG Hatzis Christos C Shi Weiwei W Samuelson Frank W FW Nikolsky Yuri Y Tsyganova Marina M Ishkin Alex A Nikolskaya Tatiana T Hess Kenneth R KR Valero Vicente V Booser Daniel D Delorenzi Mauro M Hortobagyi Gabriel N GN Shi Leming L Symmans W Fraser WF Pusztai Lajos L

Breast cancer research : BCR 20100111 1

<h4>Introduction</h4>As part of the MicroArray Quality Control (MAQC)-II project, this analysis examines how the choice of univariate feature-selection methods and classification algorithms may influence the performance of genomic predictors under varying degrees of prediction difficulty represented by three clinically relevant endpoints.<h4>Methods</h4>We used gene-expression data from 230 breast cancers (grouped into training and independent validation sets), and we examined 40 predictors (fiv ...[more]

PMID: 20064235

Similar Datasets

Project description:BackgroundInter-observer variability in stroke aetiological classification may have an effect on trial power and estimation of treatment effect. We modelled the effect of misclassification on required sample size in a hypothetical cardioembolic (CE) stroke trial.MethodsWe performed a systematic review to quantify the reliability (inter-observer variability) of various stroke aetiological classification systems. We then modelled the effect of this misclassification in a hypothetical trial of anticoagulant in CE stroke contaminated by patients with non-cardioembolic (non-CE) stroke aetiology. Rates of misclassification were based on the summary reliability estimates from our systematic review. We randomly sampled data from previous acute trials in CE and non-CE participants, using the Virtual International Stroke Trials Archive. We used bootstrapping to model the effect of varying misclassification rates on sample size required to detect a between-group treatment effect across 5000 permutations. We described outcomes in terms of survival and stroke recurrence censored at 90 days.ResultsFrom 4655 titles, we found 14 articles describing three stroke classification systems. The inter-observer reliability of the classification systems varied from 'fair' to 'very good' and suggested misclassification rates of 5% and 20% for our modelling. The hypothetical trial, with 80% power and alpha 0.05, was able to show a difference in survival between anticoagulant and antiplatelet in CE with a sample size of 198 in both trial arms. Contamination of both arms with 5% misclassified participants inflated the required sample size to 237 and with 20% misclassification inflated the required sample size to 352, for equivalent trial power. For an outcome of stroke recurrence using the same data, base-case estimated sample size for 80% power and alpha 0.05 was n = 502 in each arm, increasing to 605 at 5% contamination and 973 at 20% contamination.ConclusionsStroke aetiological classification systems suffer from inter-observer variability, and the resulting misclassification may limit trial power.Trial registrationProtocol available at reviewregistry540 .

Dataset Information

Effect of training-sample size and classification difficulty on the accuracy of genomic predictors.

Publications

Effect of training-sample size and classification difficulty on the accuracy of genomic predictors.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets