Unknown

Dataset Information

0

Transfer learning for a foundational chemistry model.


ABSTRACT: Data-driven chemistry has garnered much interest concurrent with improvements in hardware and the development of new machine learning models. However, obtaining sufficiently large, accurate datasets of a desired chemical outcome for data-driven chemistry remains a challenge. The community has made significant efforts to democratize and curate available information for more facile machine learning applications, but the limiting factor is usually the laborious nature of generating large-scale data. Transfer learning has been noted in certain applications to alleviate some of the data burden, but this protocol is typically carried out on a case-by-case basis, with the transfer learning task expertly chosen to fit the finetuning. Herein, I develop a machine learning framework capable of accurate chemistry-relevant prediction amid general sources of low data. First, a chemical "foundational model" is trained using a dataset of ∼1 million experimental organic crystal structures. A task specific module is then stacked atop this foundational model and subjected to finetuning. This approach achieves state-of-the-art performance on a diverse set of tasks: toxicity prediction, yield prediction, and odor prediction.

SUBMITTER: King-Smith E 

PROVIDER: S-EPMC10988575 | biostudies-literature | 2024 Apr

REPOSITORIES: biostudies-literature

altmetric image

Publications

Transfer learning for a foundational chemistry model.

King-Smith Emma E  

Chemical science 20231124 14


Data-driven chemistry has garnered much interest concurrent with improvements in hardware and the development of new machine learning models. However, obtaining sufficiently large, accurate datasets of a desired chemical outcome for data-driven chemistry remains a challenge. The community has made significant efforts to democratize and curate available information for more facile machine learning applications, but the limiting factor is usually the laborious nature of generating large-scale data  ...[more]

Similar Datasets

| S-EPMC11233511 | biostudies-literature
| S-EPMC11558088 | biostudies-literature
| S-EPMC8124298 | biostudies-literature
| S-EPMC10926267 | biostudies-literature
| S-EPMC11784875 | biostudies-literature
| S-EPMC6713560 | biostudies-literature
| S-EPMC10403461 | biostudies-literature
| S-EPMC11194083 | biostudies-literature
| S-EPMC10720540 | biostudies-literature
| S-EPMC4349213 | biostudies-literature