Unknown

Dataset Information

0

Boosting Tree-Assisted Multitask Deep Learning for Small Scientific Datasets.


ABSTRACT: Machine learning approaches have had tremendous success in various disciplines. However, such success highly depends on the size and quality of datasets. Scientific datasets are often small and difficult to collect. Currently, improving machine learning performance for small scientific datasets remains a major challenge in many academic fields, such as bioinformatics or medical science. Gradient boosting decision tree (GBDT) is typically optimal for small datasets, while deep learning often performs better for large datasets. This work reports a boosting tree-assisted multitask deep learning (BTAMDL) architecture that integrates GBDT and multitask deep learning (MDL) to achieve near-optimal predictions for small datasets when there exists a large dataset that is well correlated to the small datasets. Two BTAMDL models are constructed, one utilizing purely MDL output as GBDT input while the other admitting additional features in GBDT input. The proposed BTAMDL models are validated on four categories of datasets, including toxicity, partition coefficient, solubility, and solvation. It is found that the proposed BTAMDL models outperform the current state-of-the-art methods in various applications involving small datasets.

SUBMITTER: Jiang J 

PROVIDER: S-EPMC7350172 | biostudies-literature | 2020 Mar

REPOSITORIES: biostudies-literature

altmetric image

Publications

Boosting Tree-Assisted Multitask Deep Learning for Small Scientific Datasets.

Jiang Jian J   Wang Rui R   Wang Menglun M   Gao Kaifu K   Nguyen Duc Duy DD   Wei Guo-Wei GW  

Journal of chemical information and modeling 20200203 3


Machine learning approaches have had tremendous success in various disciplines. However, such success highly depends on the size and quality of datasets. Scientific datasets are often small and difficult to collect. Currently, improving machine learning performance for small scientific datasets remains a major challenge in many academic fields, such as bioinformatics or medical science. Gradient boosting decision tree (GBDT) is typically optimal for small datasets, while deep learning often perf  ...[more]

Similar Datasets

| S-EPMC9045985 | biostudies-literature
| S-EPMC8956542 | biostudies-literature
| S-EPMC8154128 | biostudies-literature
| S-EPMC9278518 | biostudies-literature
| S-EPMC8515573 | biostudies-literature
| S-EPMC7957953 | biostudies-literature
| S-EPMC7093010 | biostudies-literature
| S-EPMC8369530 | biostudies-literature
| S-EPMC10703015 | biostudies-literature
| S-EPMC5480319 | biostudies-other