Unknown

Dataset Information

0

Dataset's chemical diversity limits the generalizability of machine learning predictions.


ABSTRACT: The QM9 dataset has become the golden standard for Machine Learning (ML) predictions of various chemical properties. QM9 is based on the GDB, which is a combinatorial exploration of the chemical space. ML molecular predictions have been recently published with an accuracy on par with Density Functional Theory calculations. Such ML models need to be tested and generalized on real data. PC9, a new QM9 equivalent dataset (only H, C, N, O and F and up to 9 "heavy" atoms) of the PubChemQC project is presented in this article. A statistical study of bonding distances and chemical functions shows that this new dataset encompasses more chemical diversity. Kernel Ridge Regression, Elastic Net and the Neural Network model provided by SchNet have been used on both datasets. The overall accuracy in energy prediction is higher for the QM9 subset. However, a model trained on PC9 shows a stronger ability to predict energies of the other dataset.

SUBMITTER: Glavatskikh M 

PROVIDER: S-EPMC6852905 | biostudies-literature | 2019 Nov

REPOSITORIES: biostudies-literature

altmetric image

Publications

Dataset's chemical diversity limits the generalizability of machine learning predictions.

Glavatskikh Marta M   Leguy Jules J   Hunault Gilles G   Cauchy Thomas T   Da Mota Benoit B  

Journal of cheminformatics 20191112 1


The QM9 dataset has become the golden standard for Machine Learning (ML) predictions of various chemical properties. QM9 is based on the GDB, which is a combinatorial exploration of the chemical space. ML molecular predictions have been recently published with an accuracy on par with Density Functional Theory calculations. Such ML models need to be tested and generalized on real data. PC9, a new QM9 equivalent dataset (only H, C, N, O and F and up to 9 "heavy" atoms) of the PubChemQC project is  ...[more]

Similar Datasets

| S-EPMC10915406 | biostudies-literature
| S-EPMC5541539 | biostudies-other
| S-EPMC4476293 | biostudies-literature
| S-EPMC4114329 | biostudies-literature
| S-EPMC3786293 | biostudies-literature
2023-04-06 | GSE205588 | GEO
| S-EPMC8322325 | biostudies-literature
| S-EPMC9174159 | biostudies-literature
| S-EPMC5042084 | biostudies-literature
| S-EPMC6369796 | biostudies-literature