Dataset Information

Prediction of future customer needs using machine learning across multiple product categories

ABSTRACT:

SUBMITTER: Kilroy D

PROVIDER: S-EPMC11346667 | biostudies-literature | 2024 Jan

REPOSITORIES: biostudies-literature

ACCESS DATA

Similar Datasets

Project description:BackgroundThe logarithmic acid dissociation constant pKa reflects the ionization of a chemical, which affects lipophilicity, solubility, protein binding, and ability to pass through the plasma membrane. Thus, pKa affects chemical absorption, distribution, metabolism, excretion, and toxicity properties. Multiple proprietary software packages exist for the prediction of pKa, but to the best of our knowledge no free and open-source programs exist for this purpose. Using a freely available data set and three machine learning approaches, we developed open-source models for pKa prediction.MethodsThe experimental strongest acidic and strongest basic pKa values in water for 7912 chemicals were obtained from DataWarrior, a freely available software package. Chemical structures were curated and standardized for quantitative structure-activity relationship (QSAR) modeling using KNIME, and a subset comprising 79% of the initial set was used for modeling. To evaluate different approaches to modeling, several datasets were constructed based on different processing of chemical structures with acidic and/or basic pKas. Continuous molecular descriptors, binary fingerprints, and fragment counts were generated using PaDEL, and pKa prediction models were created using three machine learning methods, (1) support vector machines (SVM) combined with k-nearest neighbors (kNN), (2) extreme gradient boosting (XGB) and (3) deep neural networks (DNN).ResultsThe three methods delivered comparable performances on the training and test sets with a root-mean-squared error (RMSE) around 1.5 and a coefficient of determination (R2) around 0.80. Two commercial pKa predictors from ACD/Labs and ChemAxon were used to benchmark the three best models developed in this work, and performance of our models compared favorably to the commercial products.ConclusionsThis work provides multiple QSAR models to predict the strongest acidic and strongest basic pKas of chemicals, built using publicly available data, and provided as free and open-source software on GitHub.

Project description:BACKGROUND:Prognostication is an essential tool for risk adjustment and decision making in the intensive care unit (ICU). Research into prognostication in ICU has so far been limited to data from admission or the first 24 hours. Most ICU admissions last longer than this, decisions are made throughout an admission, and some admissions are explicitly intended as time-limited prognostic trials. Despite this, temporal changes in prognostic ability during ICU admission has received little attention to date. Current predictive models, in the form of prognostic clinical tools, are typically derived from linear models and do not explicitly handle incremental information from trends. Machine learning (ML) allows predictive models to be developed which use non-linear predictors and complex interactions between variables, thus allowing incorporation of trends in measured variables over time; this has made it possible to investigate prognosis throughout an admission. METHODS AND FINDINGS:This study uses ML to assess the predictability of ICU mortality as a function of time. Logistic regression against physiological data alone outperformed APACHE-II and demonstrated several important interactions including between lactate & noradrenaline dose, between lactate & MAP, and between age & MAP consistent with the current sepsis definitions. ML models consistently outperformed logistic regression with Deep Learning giving the best results. Predictive power was maximal on the second day and was further improved by incorporating trend data. Using a limited range of physiological and demographic variables, the best machine learning model on the first day showed an area under the receiver-operator characteristic curve (AUC) of 0.883 (σ = 0.008), compared to 0.846 (σ = 0.010) for a logistic regression from the same predictors and 0.836 (σ = 0.007) for a logistic regression based on the APACHE-II score. Adding information gathered on the second day of admission improved the maximum AUC to 0.895 (σ = 0.008). Beyond the second day, predictive ability declined. CONCLUSION:This has implications for decision making in intensive care and provides a justification for time-limited trials of ICU therapy; the assessment of prognosis over more than one day may be a valuable strategy as new information on the second day helps to differentiate outcomes. New ML models based on trend data beyond the first day could greatly improve upon current risk stratification tools.

Dataset Information

Prediction of future customer needs using machine learning across multiple product categories

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets