Dataset Information

An efficient algorithm coupled with synthetic minority over-sampling technique to classify imbalanced PubChem BioAssay data.

ABSTRACT: It is common that imbalanced datasets are often generated from high-throughput screening (HTS). For a given dataset without taking into account the imbalanced nature, most classification methods tend to produce high predictive accuracy for the majority class, but significantly poor performance for the minority class. In this work, an efficient algorithm, GLMBoost, coupled with Synthetic Minority Over-sampling TEchnique (SMOTE) is developed and utilized to overcome the problem for several imbalanced datasets from PubChem BioAssay. By applying the proposed combinatorial method, those data of rare samples (active compounds), for which usually poor results are generated, can be detected apparently with high balanced accuracy (Gmean). As a comparison with GLMBoost, Random Forest (RF) combined with SMOTE is also adopted to classify the same datasets. Our results show that the former (GLMBoost+SMOTE) not only exhibits higher performance as measured by the percentage of correct classification for the rare samples (Sensitivity) and Gmean, but also demonstrates greater computational efficiency than the latter (RF+SMOTE). Therefore, we hope that the proposed combinatorial algorithm based on GLMBoost and SMOTE could be extensively used to tackle the imbalanced classification problem.

SUBMITTER: Hao M

PROVIDER: S-EPMC3884825 | biostudies-literature | 2014 Jan

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

An efficient algorithm coupled with synthetic minority over-sampling technique to classify imbalanced PubChem BioAssay data.

Hao Ming M Wang Yanli Y Bryant Stephen H SH

Analytica chimica acta 20131106

It is common that imbalanced datasets are often generated from high-throughput screening (HTS). For a given dataset without taking into account the imbalanced nature, most classification methods tend to produce high predictive accuracy for the majority class, but significantly poor performance for the minority class. In this work, an efficient algorithm, GLMBoost, coupled with Synthetic Minority Over-sampling TEchnique (SMOTE) is developed and utilized to overcome the problem for several imbalan ...[more]

PMID: 24331047

Dataset Information

An efficient algorithm coupled with synthetic minority over-sampling technique to classify imbalanced PubChem BioAssay data.

Publications

An efficient algorithm coupled with synthetic minority over-sampling technique to classify imbalanced PubChem BioAssay data.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

PubChem BioAssay: 2017 update.
| S-EPMC5210581 | biostudies-literature

STB: synthetic minority oversampling technique for tree-boosting models for imbalanced datasets of intrusion detection systems.
| S-EPMC10703015 | biostudies-literature

Speech Emotion Recognition Based on Selective Interpolation Synthetic Minority Over-Sampling Technique in Small Sample Environment.
| S-EPMC7219047 | biostudies-literature

A cluster-based SMOTE both-sampling (CSBBoost) ensemble algorithm for classifying imbalanced data.
| S-EPMC10908853 | biostudies-literature

Predicting adverse drug reactions using publicly available PubChem BioAssay data.
| S-EPMC3464971 | biostudies-literature

QSAR modeling of imbalanced high-throughput screening data in PubChem.
| S-EPMC3985743 | biostudies-other

Over- and Under-sampling Approach for Extremely Imbalanced and Small Minority Data Problem in Health Record Analysis.
| S-EPMC7248318 | biostudies-literature

PubChem BioAssay: A Decade's Development toward Open High-Throughput Screening Data Sharing.
| S-EPMC5480605 | biostudies-literature

Benchmarking Data Sets from PubChem BioAssay Data: Current Scenario and Room for Improvement.
| S-EPMC7352161 | biostudies-literature

Tuning gradient boosting for imbalanced bioassay modelling with custom loss functions.
| S-EPMC9650867 | biostudies-literature