Dataset Information

An evolutionary decomposition-based multi-objective feature selection for multi-label classification.

ABSTRACT: Data classification is a fundamental task in data mining. Within this field, the classification of multi-labeled data has been seriously considered in recent years. In such problems, each data entity can simultaneously belong to several categories. Multi-label classification is important because of many recent real-world applications in which each entity has more than one label. To improve the performance of multi-label classification, feature selection plays an important role. It involves identifying and removing irrelevant and redundant features that unnecessarily increase the dimensions of the search space for the classification problems. However, classification may fail with an extreme decrease in the number of relevant features. Thus, minimizing the number of features and maximizing the classification accuracy are two desirable but conflicting objectives in multi-label feature selection. In this article, we introduce a multi-objective optimization algorithm customized for selecting the features of multi-label data. The proposed algorithm is an enhanced variant of a decomposition-based multi-objective optimization approach, in which the multi-label feature selection problem is divided into single-objective subproblems that can be simultaneously solved using an evolutionary algorithm. This approach leads to accelerating the optimization process and finding more diverse feature subsets. The proposed method benefits from a local search operator to find better solutions for each subproblem. We also define a pool of genetic operators to generate new feature subsets based on old generation. To evaluate the performance of the proposed algorithm, we compare it with two other multi-objective feature selection approaches on eight real-world benchmark datasets that are commonly used for multi-label classification. The reported results of multi-objective method evaluation measures, such as hypervolume indicator and set coverage, illustrate an improvement in the results obtained by the proposed method. Moreover, the proposed method achieved better results in terms of classification accuracy with fewer features compared with state-of-the-art methods.

SUBMITTER: Asilian Bidgoli A

PROVIDER: S-EPMC7924502 | biostudies-literature |

REPOSITORIES: biostudies-literature

ACCESS DATA

Similar Datasets

Project description:BackgroundIn the field of computational biology, analyzing complex data helps to extract relevant biological information. Sample classification of gene expression data is one such popular bio-data analysis technique. However, the presence of a large number of irrelevant/redundant genes in expression data makes a sample classification algorithm working inefficiently. Feature selection is one such high-dimensionality reduction technique that helps to maximize the effectiveness of any sample classification algorithm. Recent advances in biotechnology have improved the biological data to include multi-modal or multiple views. Different 'omics' resources capture various equally important biological properties of entities. However, most of the existing feature selection methodologies are biased towards considering only one out of multiple biological resources. Consequently, some crucial aspects of available biological knowledge may get ignored, which could further improve feature selection efficiency.ResultsIn this present work, we have proposed a Consensus Multi-View Multi-objective Clustering-based feature selection algorithm called CMVMC. Three controlled genomic and proteomic resources like gene expression, Gene Ontology (GO), and protein-protein interaction network (PPIN) are utilized to build two independent views. The concept of multi-objective consensus clustering has been applied within our proposed gene selection method to satisfy both incorporated views. Gene expression data sets of Multiple tissues and Yeast from two different organisms (Homo Sapiens and Saccharomyces cerevisiae, respectively) are chosen for experimental purposes. As the end-product of CMVMC, a reduced set of relevant and non-redundant genes are found for each chosen data set. These genes finally participate in an effective sample classification.ConclusionsThe experimental study on chosen data sets shows that our proposed feature-selection method improves the sample classification accuracy and reduces the gene-space up to a significant level. In the case of Multiple Tissues data set, CMVMC reduces the number of genes (features) from 5565 to 41, with 92.73% of sample classification accuracy. For Yeast data set, the number of genes got reduced to 10 from 2884, with 95.84% sample classification accuracy. Two internal cluster validity indices - Silhouette and Davies-Bouldin (DB) and one external validity index Classification Accuracy (CA) are chosen for comparative study. Reported results are further validated through well-known biological significance test and visualization tool.

Project description:Epidemiological time series forecasting plays an important role in health public systems, due to its ability to allow managers to develop strategic planning to avoid possible epidemics. In this paper, a hybrid learning framework is developed to forecast multi-step-ahead (one, two and three-month-ahead) meningitis cases in four states of Brazil. First, the proposed approach applies an ensemble empirical mode decomposition (EEMD) to decompose the data into intrinsic mode functions and residual components. Then, each component is used as the input of five different forecasting models, and, from there, forecasted results are obtained. Finally, all combinations of models and components are developed, and for each case, the forecasted results are weighted integrated (WI) to formulate a heterogeneous ensemble forecaster for the monthly meningitis cases. In the final stage, a multi-objective optimization (MOO) using the Non-Dominated Sorting Genetic Algorithm - version II is employed to find a set of candidates' weights, and then the Technique for Order of Preference by similarity to Ideal Solution (TOPSIS) is applied to choose the adequate set of weights. Next, the most adequate model is the one with the best generalization capacity out-of-sample in terms of performance criteria including mean absolute error (MAE), relative root mean squared error (RRMSE) and symmetric mean absolute percentage error (sMAPE). By using MOO, the intention is to enhance the performance of the forecasting models by improving simultaneously their accuracy and stability measures. To access the model's performance, comparisons based on metrics are conducted with: (i) EEMD, heterogeneous ensemble integrated by direct strategy, or simple sum; (ii) EEMD, homogeneous ensemble of components WI; (iii) models without signal decomposition. At this stage, MAE, RRMSE, sMAPE criteria and Diebold-Mariano statistical test are adopted. In all twelve scenarios, the proposed framework was able to perform more accurate and stable forecasts, which showed, on 89.17% of the cases, that the errors of the proposed approach are statistically lower than other approaches. These results showed that combining EEMD, heterogeneous ensemble and WI with weights obtained by optimization can develop precise and stable forecasts. The modelling developed in this paper is promising and can be used by managers to support decision making.

Dataset Information

An evolutionary decomposition-based multi-objective feature selection for multi-label classification.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets