Dataset Information

Top scoring pairs for feature selection in machine learning and applications to cancer outcome prediction.

ABSTRACT:

Background

The widely used k top scoring pair (k-TSP) algorithm is a simple yet powerful parameter-free classifier. It owes its success in many cancer microarray datasets to an effective feature selection algorithm that is based on relative expression ordering of gene pairs. However, its general robustness does not extend to some difficult datasets, such as those involving cancer outcome prediction, which may be due to the relatively simple voting scheme used by the classifier. We believe that the performance can be enhanced by separating its effective feature selection component and combining it with a powerful classifier such as the support vector machine (SVM). More generally the top scoring pairs generated by the k-TSP ranking algorithm can be used as a dimensionally reduced subspace for other machine learning classifiers.

Results

We developed an approach integrating the k-TSP ranking algorithm (TSP) with other machine learning methods, allowing combination of the computationally efficient, multivariate feature ranking of k-TSP with multivariate classifiers such as SVM. We evaluated this hybrid scheme (k-TSP+SVM) in a range of simulated datasets with known data structures. As compared with other feature selection methods, such as a univariate method similar to Fisher's discriminant criterion (Fisher), or a recursive feature elimination embedded in SVM (RFE), TSP is increasingly more effective than the other two methods as the informative genes become progressively more correlated, which is demonstrated both in terms of the classification performance and the ability to recover true informative genes. We also applied this hybrid scheme to four cancer prognosis datasets, in which k-TSP+SVM outperforms k-TSP classifier in all datasets, and achieves either comparable or superior performance to that using SVM alone. In concurrence with what is observed in simulation, TSP appears to be a better feature selector than Fisher and RFE in some of the cancer datasets

Conclusions

The k-TSP ranking algorithm can be used as a computationally efficient, multivariate filter method for feature selection in machine learning. SVM in combination with k-TSP ranking algorithm outperforms k-TSP and SVM alone in simulated datasets and in some cancer prognosis datasets. Simulation studies suggest that as a feature selector, it is better tuned to certain data characteristics, i.e. correlations among informative genes, which is potentially interesting as an alternative feature ranking method in pathway analysis.

SUBMITTER: Shi P

PROVIDER: S-EPMC3223741 | biostudies-literature | 2011 Sep

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Top scoring pairs for feature selection in machine learning and applications to cancer outcome prediction.

Shi Ping P Ray Surajit S Zhu Qifu Q Kon Mark A MA

BMC bioinformatics 20110923

<h4>Background</h4>The widely used k top scoring pair (k-TSP) algorithm is a simple yet powerful parameter-free classifier. It owes its success in many cancer microarray datasets to an effective feature selection algorithm that is based on relative expression ordering of gene pairs. However, its general robustness does not extend to some difficult datasets, such as those involving cancer outcome prediction, which may be due to the relatively simple voting scheme used by the classifier. We believ ...[more]

PMID: 21939564

Similar Datasets

Project description:IntroductionCardiotocography (CTG) consists of two biophysical signals that are fetal heart rate (FHR) and uterine contraction (UC). In this research area, the computerized systems are usually utilized to provide more objective and repeatable results.Materials and methodsFeature selection algorithms are of great importance regarding the computerized systems to not only reduce the dimension of feature set but also to reveal the most relevant features without losing too much information. In this paper, three filters and two wrappers feature selection methods and machine learning models, which are artificial neural network (ANN), k-nearest neighbor (kNN), decision tree (DT), and support vector machine (SVM), are evaluated on a high dimensional feature set obtained from an open-access CTU-UHB intrapartum CTG database. The signals are divided into two classes as normal and hypoxic considering umbilical artery pH value (pH?<?7.20) measured after delivery. A comprehensive diagnostic feature set forming the features obtained from morphological, linear, nonlinear, time-frequency and image-based time-frequency domains is generated first. Then, combinations of the feature selection algorithms and machine learning models are evaluated to achieve the most effective features as well as high classification performance.ResultsThe experimental results show that it is possible to achieve better classification performance using lower dimensional feature set that comprises of more related features, instead of the high-dimensional feature set. The most informative feature subset was generated by considering the frequency of selection of the features by feature selection algorithms. As a result, the most efficient results were produced by selected only 12 relevant features instead of a full feature set consisting of 30 diagnostic indices and SVM model. Sensitivity and specificity were achieved as 77.40% and 93.86%, respectively.ConclusionConsequently, the evaluation of multiple feature selection algorithms resulted in achieving the best results.

Project description:BACKGROUND:Prognostication is an essential tool for risk adjustment and decision making in the intensive care unit (ICU). Research into prognostication in ICU has so far been limited to data from admission or the first 24 hours. Most ICU admissions last longer than this, decisions are made throughout an admission, and some admissions are explicitly intended as time-limited prognostic trials. Despite this, temporal changes in prognostic ability during ICU admission has received little attention to date. Current predictive models, in the form of prognostic clinical tools, are typically derived from linear models and do not explicitly handle incremental information from trends. Machine learning (ML) allows predictive models to be developed which use non-linear predictors and complex interactions between variables, thus allowing incorporation of trends in measured variables over time; this has made it possible to investigate prognosis throughout an admission. METHODS AND FINDINGS:This study uses ML to assess the predictability of ICU mortality as a function of time. Logistic regression against physiological data alone outperformed APACHE-II and demonstrated several important interactions including between lactate & noradrenaline dose, between lactate & MAP, and between age & MAP consistent with the current sepsis definitions. ML models consistently outperformed logistic regression with Deep Learning giving the best results. Predictive power was maximal on the second day and was further improved by incorporating trend data. Using a limited range of physiological and demographic variables, the best machine learning model on the first day showed an area under the receiver-operator characteristic curve (AUC) of 0.883 (σ = 0.008), compared to 0.846 (σ = 0.010) for a logistic regression from the same predictors and 0.836 (σ = 0.007) for a logistic regression based on the APACHE-II score. Adding information gathered on the second day of admission improved the maximum AUC to 0.895 (σ = 0.008). Beyond the second day, predictive ability declined. CONCLUSION:This has implications for decision making in intensive care and provides a justification for time-limited trials of ICU therapy; the assessment of prognosis over more than one day may be a valuable strategy as new information on the second day helps to differentiate outcomes. New ML models based on trend data beyond the first day could greatly improve upon current risk stratification tools.

Project description:Feature selection (FS, i.e., selection of a subset of predictor variables) is essential in high-dimensional datasets to prevent overfitting of prediction/classification models and reduce computation time and resources. In genomics, FS allows identifying relevant markers and designing low-density SNP chips to evaluate selection candidates. In this research, several univariate and multivariate FS algorithms combined with various parametric and non-parametric learners were applied to the prediction of feed efficiency in growing pigs from high-dimensional genomic data. The objective was to find the best combination of feature selector, SNP subset size, and learner leading to accurate and stable (i.e., less sensitive to changes in the training data) prediction models. Genomic best linear unbiased prediction (GBLUP) without SNP pre-selection was the benchmark. Three types of FS methods were implemented: (i) filter methods: univariate (univ.dtree, spearcor) or multivariate (cforest, mrmr), with random selection as benchmark; (ii) embedded methods: elastic net and least absolute shrinkage and selection operator (LASSO) regression; (iii) combination of filter and embedded methods. Ridge regression, support vector machine (SVM), and gradient boosting (GB) were applied after pre-selection performed with the filter methods. Data represented 5,708 individual records of residual feed intake to be predicted from the animal's own genotype. Accuracy (stability of results) was measured as the median (interquartile range) of the Spearman correlation between observed and predicted data in a 10-fold cross-validation. The best prediction in terms of accuracy and stability was obtained with SVM and GB using 500 or more SNPs [0.28 (0.02) and 0.27 (0.04) for SVM and GB with 1,000 SNPs, respectively]. With larger subset sizes (1,000-1,500 SNPs), the filter method had no influence on prediction quality, which was similar to that attained with a random selection. With 50-250 SNPs, the FS method had a huge impact on prediction quality: it was very poor for tree-based methods combined with any learner, but good and similar to what was obtained with larger SNP subsets when spearcor or mrmr were implemented with or without embedded methods. Those filters also led to very stable results, suggesting their potential use for designing low-density SNP chips for genome-based evaluation of feed efficiency.

Dataset Information

Top scoring pairs for feature selection in machine learning and applications to cancer outcome prediction.

Background

Results

Conclusions

Publications

Top scoring pairs for feature selection in machine learning and applications to cancer outcome prediction.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets