Dataset Information

Effect of Dataset Size and Train/Test Split Ratios in QSAR/QSPR Multiclass Classification.

ABSTRACT: Applied datasets can vary from a few hundred to thousands of samples in typical quantitative structure-activity/property (QSAR/QSPR) relationships and classification. However, the size of the datasets and the train/test split ratios can greatly affect the outcome of the models, and thus the classification performance itself. We compared several combinations of dataset sizes and split ratios with five different machine learning algorithms to find the differences or similarities and to select the best parameter settings in nonbinary (multiclass) classification. It is also known that the models are ranked differently according to the performance merit(s) used. Here, 25 performance parameters were calculated for each model, then factorial ANOVA was applied to compare the results. The results clearly show the differences not just between the applied machine learning algorithms but also between the dataset sizes and to a lesser extent the train/test split ratios. The XGBoost algorithm could outperform the others, even in multiclass modeling. The performance parameters reacted differently to the change of the sample set size; some of them were much more sensitive to this factor than the others. Moreover, significant differences could be detected between train/test split ratios as well, exerting a great effect on the test validation of our models.

SUBMITTER: Racz A

PROVIDER: S-EPMC7922354 | biostudies-literature | 2021 Feb

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Effect of Dataset Size and Train/Test Split Ratios in QSAR/QSPR Multiclass Classification.

Rácz Anita A Bajusz Dávid D Héberger Károly K

Molecules (Basel, Switzerland) 20210219 4

Applied datasets can vary from a few hundred to thousands of samples in typical quantitative structure-activity/property (QSAR/QSPR) relationships and classification. However, the size of the datasets and the train/test split ratios can greatly affect the outcome of the models, and thus the classification performance itself. We compared several combinations of dataset sizes and split ratios with five different machine learning algorithms to find the differences or similarities and to select the ...[more]

PMID: 33669834

Similar Datasets

Project description:Motor imagery (MI) allows the design of self-paced brain-computer interfaces (BCIs), which can potentially afford an intuitive and continuous interaction. However, the implementation of non-invasive MI-based BCIs with more than three commands is still a difficult task. First, the number of MIs for decoding different actions is limited by the constraint of maintaining an adequate spacing among the corresponding sources, since the electroencephalography (EEG) activity from near regions may add up. Second, EEG generates a rather noisy image of brain activity, which results in a poor classification performance. Here, we propose a solution to address the limitation of identifiable motor activities by using combined MIs (i.e., MIs involving 2 or more body parts at the same time). And we propose two new multilabel uses of the Common Spatial Pattern (CSP) algorithm to optimize the signal-to-noise ratio, namely MC2CMI and MC2SMI approaches. We recorded EEG signals from seven healthy subjects during an 8-class EEG experiment including the rest condition and all possible combinations using the left hand, right hand, and feet. The proposed multilabel approaches convert the original 8-class problem into a set of three binary problems to facilitate the use of the CSP algorithm. In the case of the MC2CMI method, each binary problem groups together in one class all the MIs engaging one of the three selected body parts, while the rest of MIs that do not engage the same body part are grouped together in the second class. In this way, for each binary problem, the CSP algorithm produces features to determine if the specific body part is engaged in the task or not. Finally, three sets of features are merged together to predict the user intention by applying an 8-class linear discriminant analysis. The MC2SMI method is quite similar, the only difference is that any of the combined MIs is considered during the training phase, which drastically accelerates the calibration time. For all subjects, both the MC2CMI and the MC2SMI approaches reached a higher accuracy than the classic pair-wise (PW) and one-vs.-all (OVA) methods. Our results show that, when brain activity is properly modulated, multilabel approaches represent a very interesting solution to increase the number of commands, and thus to provide a better interaction.

Dataset Information

Effect of Dataset Size and Train/Test Split Ratios in QSAR/QSPR Multiclass Classification.

Publications

Effect of Dataset Size and Train/Test Split Ratios in QSAR/QSPR Multiclass Classification.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets