Dataset Information

Classification of microarrays; synergistic effects between normalization, gene selection and machine learning.

ABSTRACT: BACKGROUND: Machine learning is a powerful approach for describing and predicting classes in microarray data. Although several comparative studies have investigated the relative performance of various machine learning methods, these often do not account for the fact that performance (e.g. error rate) is a result of a series of analysis steps of which the most important are data normalization, gene selection and machine learning. RESULTS: In this study, we used seven previously published cancer-related microarray data sets to compare the effects on classification performance of five normalization methods, three gene selection methods with 21 different numbers of selected genes and eight machine learning methods. Performance in term of error rate was rigorously estimated by repeatedly employing a double cross validation approach. Since performance varies greatly between data sets, we devised an analysis method that first compares methods within individual data sets and then visualizes the comparisons across data sets. We discovered both well performing individual methods and synergies between different methods. CONCLUSION: Support Vector Machines with a radial basis kernel, linear kernel or polynomial kernel of degree 2 all performed consistently well across data sets. We show that there is a synergistic relationship between these methods and gene selection based on the T-test and the selection of a relatively high number of genes. Also, we find that these methods benefit significantly from using normalized data, although it is hard to draw general conclusions about the relative performance of different normalization procedures.

SUBMITTER: Onskog J

PROVIDER: S-EPMC3229535 | biostudies-literature | 2011

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Classification of microarrays; synergistic effects between normalization, gene selection and machine learning.

Önskog Jenny J Freyhult Eva E Landfors Mattias M Rydén Patrik P Hvidsten Torgeir R TR

BMC bioinformatics 20111007

<h4>Background</h4>Machine learning is a powerful approach for describing and predicting classes in microarray data. Although several comparative studies have investigated the relative performance of various machine learning methods, these often do not account for the fact that performance (e.g. error rate) is a result of a series of analysis steps of which the most important are data normalization, gene selection and machine learning.<h4>Results</h4>In this study, we used seven previously publi ...[more]

PMID: 21982277

Similar Datasets

Project description:In recent years, functional brain network topological features have been widely used as classification features. Previous studies have found that network node scale differences caused by different network parcellation definitions significantly affect the structure of the constructed network and its topological properties. However, we still do not know how network scale differences affect the classification accuracy, performance of classification features, and effectiveness of the feature selection strategy using P values in terms of the machine learning method. This study used five scale parcellations, involving 90, 256, 497, 1003, and 1501 nodes. Three local properties of resting-state functional brain networks were selected (degree, betweenness centrality, and nodal efficiency), and the support vector machine method was used to construct classifiers to identify patients with major depressive disorder. We analyzed the impact of the five scales on classification accuracy. In addition, the effectiveness and redundancy of features obtained by the different scale parcellations were compared. Finally, traditional statistical significance (P value) was verified as a feature selection criterion. The results showed that the feature effectiveness of different scales was similar; in other words, parcellation with more regions did not provide more effective discriminative features. Nevertheless, parcellation with more regions did provide a greater quantity of discriminative features, which led to an improvement in the accuracy of the classification. However, due to the close distance between brain regions, the redundancy of parcellation with more regions was also greater. The traditional P value feature selection strategy is feasible with different scales, but our analysis showed that the traditional P < 0.05 threshold was too strict for feature selection. This study provides an important reference for the selection of network scales when applying topological properties of brain networks to machine learning methods.

Project description:MotivationCross-(multi)platform normalization of gene-expression microarray data remains an unresolved issue. Despite the existence of several algorithms, they are either constrained by the need to normalize all samples of all platforms together, compromising scalability and reuse, by adherence to the platforms of a specific provider, or simply by poor performance. In addition, many of the methods presented in the literature have not been specifically tested against multi-platform data and/or other methods applicable in this context. Thus, we set out to develop a normalization algorithm appropriate for gene-expression studies based on multiple, potentially large microarray sets collected along multiple platforms and at different times, applicable in systematic studies aimed at extracting knowledge from the wealth of microarray data available in public repositories; for example, for the extraction of Real-World Data to complement data from Randomized Controlled Trials. Our main focus or criterion for performance was on the capacity of the algorithm to properly separate samples from different biological groups.ResultsWe present CuBlock, an algorithm addressing this objective, together with a strategy to validate cross-platform normalization methods. To validate the algorithm and benchmark it against existing methods, we used two distinct datasets, one specifically generated for testing and standardization purposes and one from an actual experimental study. Using these datasets, we benchmarked CuBlock against ComBat (Johnson et al., 2007), UPC (Piccolo et al., 2013), YuGene (Lê Cao et al., 2014), DBNorm (Meng et al., 2017), Shambhala (Borisov et al., 2019) and a simple log2 transform as reference. We note that many other popular normalization methods are not applicable in this context. CuBlock was the only algorithm in this group that could always and clearly differentiate the underlying biological groups after mixing the data, from up to six different platforms in this study.Availability and implementationCuBlock can be downloaded from https://www.mathworks.com/matlabcentral/fileexchange/77882-cublock.Supplementary informationSupplementary data are available at Bioinformatics online.

Dataset Information

Classification of microarrays; synergistic effects between normalization, gene selection and machine learning.

Publications

Classification of microarrays; synergistic effects between normalization, gene selection and machine learning.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets