Browse
Submit Data
Databases
API
Help

Dataset Information

14 Views

0 Connections

0 Citations

0 Reanalyses

0 Downloads

Omics score: 0

Implicitly perturbed Hamiltonian as a class of versatile and general-purpose molecular representations for machine learning.

ABSTRACT: Unraveling challenging problems by machine learning has recently become a hot topic in many scientific disciplines. For developing rigorous machine-learning models to study problems of interest in molecular sciences, translating molecular structures to quantitative representations as suitable machine-learning inputs play a central role. Many different molecular representations and the state-of-the-art ones, although efficient in studying numerous molecular features, still are suboptimal in many challenging cases, as discussed in the context of the present research. The main aim of the present study is to introduce the Implicitly Perturbed Hamiltonian (ImPerHam) as a class of versatile representations for more efficient machine learning of challenging problems in molecular sciences. ImPerHam representations are defined as energy attributes of the molecular Hamiltonian, implicitly perturbed by a number of hypothetic or real arbitrary solvents based on continuum solvation models. We demonstrate the outstanding performance of machine-learning models based on ImPerHam representations for three diverse and challenging cases of predicting inhibition of the CYP450 enzyme, high precision, and transferrable evaluation of non-covalent interaction energy of molecular systems, and accurately reproducing solvation free energies for large benchmark sets.

SUBMITTER: Alibakhshi A

PROVIDER: S-EPMC8913769 | biostudies-literature |

REPOSITORIES: biostudies-literature

ACCESS DATA

Json Xml

Similar Datasets

A general-purpose machine-learning force field for bulk and nanostructured phosphorus.

Project description:Elemental phosphorus is attracting growing interest across fundamental and applied fields of research. However, atomistic simulations of phosphorus have remained an outstanding challenge. Here, we show that a universally applicable force field for phosphorus can be created by machine learning (ML) from a suitably chosen ensemble of quantum-mechanical results. Our model is fitted to density-functional theory plus many-body dispersion (DFT + MBD) data; its accuracy is demonstrated for the exfoliation of black and violet phosphorus (yielding monolayers of "phosphorene" and "hittorfene"); its transferability is shown for the transition between the molecular and network liquid phases. An application to a phosphorene nanoribbon on an experimentally relevant length scale exemplifies the power of accurate and flexible ML-driven force fields for next-generation materials modelling. The methodology promises new insights into phosphorus as well as other structurally complex, e.g., layered solids that are relevant in diverse areas of chemistry, physics, and materials science.

| S-EPMC7596484 | biostudies-literature

Learning cortical representations through perturbed and adversarial dreaming.

Project description:Humans and other animals learn to extract general concepts from sensory experience without extensive teaching. This ability is thought to be facilitated by offline states like sleep where previous experiences are systemically replayed. However, the characteristic creative nature of dreams suggests that learning semantic representations may go beyond merely replaying previous experiences. We support this hypothesis by implementing a cortical architecture inspired by generative adversarial networks (GANs). Learning in our model is organized across three different global brain states mimicking wakefulness, non-rapid eye movement (NREM), and REM sleep, optimizing different, but complementary, objective functions. We train the model on standard datasets of natural images and evaluate the quality of the learned representations. Our results suggest that generating new, virtual sensory inputs via adversarial dreaming during REM sleep is essential for extracting semantic concepts, while replaying episodic memories via perturbed dreaming during NREM sleep improves the robustness of latent representations. The model provides a new computational perspective on sleep states, memory replay, and dreams, and suggests a cortical implementation of GANs.

| S-EPMC9071267 | biostudies-literature

Machine Learning C-N Couplings: Obstacles for a General-Purpose Reaction Yield Prediction.

Project description:Pd-catalyzed C-N couplings are commonplace in academia and industry. Despite their significance, finding suitable reaction conditions leading to a high yield, for instance, remains a challenging and time-consuming task which usually requires screening over many sets of conditions. To help select promising reaction conditions in the vast space of reagent combinations, machine learning is an emerging technique with a lot of promise. In this work, we assess whether the reaction yield of C-N couplings can be predicted from databases of chemical reactions. We test the generalizability of models both on challenging data splits and on a dedicated experimental test set. We find that, provided the chemical space represented by the training set is not left, the models perform well. However, the applicability domain is quickly left even for simple reactions of the same type, as, for instance, present in our plate test set. The results show that yield prediction for new reactions is possible from the algorithmic side but in practice is hindered by the available data. Most importantly, more data that cover the diversity in reagents are needed for a general-purpose prediction of reaction yields. Our findings also expose a challenge to this field in that it appears to be extremely deceiving to judge models based on literature data with test sets which are split off the same literature data, even when challenging splits are considered.

| S-EPMC9878668 | biostudies-literature

SPA<sup>H</sup>M(a,b): Encoding the Density Information from Guess Hamiltonian in Quantum Machine Learning Representations.

Project description:Recently, we introduced a class of molecular representations for kernel-based regression methods─the spectrum of approximated Hamiltonian matrices (SPAHM)─that takes advantage of lightweight one-electron Hamiltonians traditionally used as a self-consistent field initial guess. The original SPAHM variant is built from occupied-orbital energies (i.e., eigenvalues) and naturally contains all of the information about nuclear charges, atomic positions, and symmetry requirements. Its advantages were demonstrated on data sets featuring a wide variation of charge and spin, for which traditional structure-based representations commonly fail. SPAHM(a,b), as introduced here, expand the eigenvalue SPAHM into local and transferable representations. They rely upon one-electron density matrices to build fingerprints from atomic and bond density overlap contributions inspired from preceding state-of-the-art representations. The performance and efficiency of SPAHM(a,b) is assessed on the predictions for data sets of prototypical organic molecules (QM7) of different charges and azoheteroarene dyes in an excited state. Overall, both SPAHM(a) and SPAHM(b) outperform state-of-the-art representations on difficult prediction tasks such as the atomic properties of charged open-shell species and of π-conjugated systems.

| S-EPMC10867806 | biostudies-literature

Does sleep facilitate the consolidation of allocentric or egocentric representations of implicitly learned visual-motor sequence learning?

Project description:Sleep facilitates the consolidation (i.e., enhancement) of simple, explicit (i.e., conscious) motor sequence learning (MSL). MSL can be dissociated into egocentric (i.e., motor) or allocentric (i.e., spatial) frames of reference. The consolidation of the allocentric memory representation is sleep-dependent, whereas the egocentric consolidation process is independent of sleep or wake for explicit MSL. However, it remains unclear the extent to which sleep contributes to the consolidation of implicit (i.e., unconscious) MSL, nor is it known what aspects of the memory representation (egocentric, allocentric) are consolidated by sleep. Here, we investigated the extent to which sleep is involved in consolidating implicit MSL, specifically, whether the egocentric or the allocentric cognitive representations of a learned sequence are enhanced by sleep, and whether these changes support the development of explicit sequence knowledge across sleep but not wake. Our results indicate that egocentric and allocentric representations can be behaviorally dissociated for implicit MSL. Neither representation was preferentially enhanced across sleep nor were developments of explicit awareness observed. However, after a 1-wk interval performance enhancement was observed in the egocentric representation. Taken together, these results suggest that like explicit MSL, implicit MSL has dissociable allocentric and egocentric representations, but unlike explicit sequence learning, implicit egocentric and allocentric memory consolidation is independent of sleep, and the time-course of consolidation differs significantly.

| S-EPMC5772393 | biostudies-literature

A General-Purpose Machine Learning R Library for Sparse Kernels Methods With an Application for Genome-Based Prediction

Project description:The adoption of machine learning frameworks in areas beyond computer science have been facilitated by the development of user-friendly software tools that do not require an advanced understanding of computer programming. In this paper, we present a new package (sparse kernel methods, SKM) software developed in R language for implementing six (generalized boosted machines, generalized linear models, support vector machines, random forest, Bayesian regression models and deep neural networks) of the most popular supervised machine learning algorithms with the optional use of sparse kernels. The SKM focuses on user simplicity, as it does not try to include all the available machine learning algorithms, but rather the most important aspects of these six algorithms in an easy-to-understand format. Another relevant contribution of this package is a function for the computation of seven different kernels. These are Linear, Polynomial, Sigmoid, Gaussian, Exponential, Arc-Cosine 1 and Arc-Cosine L (with L = 2, 3, … ) and their sparse versions, which allow users to create kernel machines without modifying the statistical machine learning algorithm. It is important to point out that the main contribution of our package resides in the functionality for the computation of the sparse version of seven basic kernels, which is indispensable for reducing computational resources to implement kernel machine learning methods without a significant loss in prediction performance. Performance of the SKM is evaluated in a genome-based prediction framework using both a maize and wheat data set. As such, the use of this package is not restricted to genome prediction problems, and can be used in many different applications.

| S-EPMC9205295 | biostudies-literature

Hamiltonian-Reservoir Replica Exchange and Machine Learning Potentials for Computational Organic Chemistry.

Project description:This work combines a machine learning potential energy function with a modular enhanced sampling scheme to obtain statistically converged thermodynamical properties of flexible medium-size organic molecules at high ab initio level. We offer a modular environment in the python package MORESIM that allows custom design of replica exchange simulations with any level of theory including ML-based potentials. Our specific combination of Hamiltonian and reservoir replica exchange is shown to be a powerful technique to accelerate enhanced sampling simulations and explore free energy landscapes with a quantum chemical accuracy unattainable otherwise (e.g., DLPNO-CCSD(T)/CBS quality). This engine is used to demonstrate the relevance of accessing the ab initio free energy landscapes of molecules whose stability is determined by a subtle interplay between variations in the underlying potential energy and conformational entropy (i.e., a bridged asymmetrically polarized dithiacyclophane and a widely used organocatalyst) both in the gas phase and in solution (implicit solvent).

| S-EPMC7704029 | biostudies-literature

Deep learning encodes robust discriminative neuroimaging representations to outperform standard machine learning.

Project description:Recent critical commentaries unfavorably compare deep learning (DL) with standard machine learning (SML) approaches for brain imaging data analysis. However, their conclusions are often based on pre-engineered features depriving DL of its main advantage - representation learning. We conduct a large-scale systematic comparison profiled in multiple classification and regression tasks on structural MRI images and show the importance of representation learning for DL. Results show that if trained following prevalent DL practices, DL methods have the potential to scale particularly well and substantially improve compared to SML methods, while also presenting a lower asymptotic complexity in relative computational time, despite being more complex. We also demonstrate that DL embeddings span comprehensible task-specific projection spectra and that DL consistently localizes task-discriminative brain biomarkers. Our findings highlight the presence of nonlinearities in neuroimaging data that DL can exploit to generate superior task-discriminative representations for characterizing the human brain.

| S-EPMC7806588 | biostudies-literature

Adapting machine learning techniques to censored time-to-event health record data: A general-purpose approach using inverse probability of censoring weighting.

Project description:Models for predicting the probability of experiencing various health outcomes or adverse events over a certain time frame (e.g., having a heart attack in the next 5years) based on individual patient characteristics are important tools for managing patient care. Electronic health data (EHD) are appealing sources of training data because they provide access to large amounts of rich individual-level data from present-day patient populations. However, because EHD are derived by extracting information from administrative and clinical databases, some fraction of subjects will not be under observation for the entire time frame over which one wants to make predictions; this loss to follow-up is often due to disenrollment from the health system. For subjects without complete follow-up, whether or not they experienced the adverse event is unknown, and in statistical terms the event time is said to be right-censored. Most machine learning approaches to the problem have been relatively ad hoc; for example, common approaches for handling observations in which the event status is unknown include (1) discarding those observations, (2) treating them as non-events, (3) splitting those observations into two observations: one where the event occurs and one where the event does not. In this paper, we present a general-purpose approach to account for right-censored outcomes using inverse probability of censoring weighting (IPCW). We illustrate how IPCW can easily be incorporated into a number of existing machine learning algorithms used to mine big health care data including Bayesian networks, k-nearest neighbors, decision trees, and generalized additive models. We then show that our approach leads to better calibrated predictions than the three ad hoc approaches when applied to predicting the 5-year risk of experiencing a cardiovascular adverse event, using EHD from a large U.S. Midwestern healthcare system.

| S-EPMC4893987 | biostudies-literature

Reaction-based machine learning representations for predicting the enantioselectivity of organocatalysts.

Project description:Hundreds of catalytic methods are developed each year to meet the demand for high-purity chiral compounds. The computational design of enantioselective organocatalysts remains a significant challenge, as catalysts are typically discovered through experimental screening. Recent advances in combining quantum chemical computations and machine learning (ML) hold great potential to propel the next leap forward in asymmetric catalysis. Within the context of quantum chemical machine learning (QML, or atomistic ML), the ML representations used to encode the three-dimensional structure of molecules and evaluate their similarity cannot easily capture the subtle energy differences that govern enantioselectivity. Here, we present a general strategy for improving molecular representations within an atomistic machine learning model to predict the DFT-computed enantiomeric excess of asymmetric propargylation organocatalysts solely from the structure of catalytic cycle intermediates. Mean absolute errors as low as 0.25 kcal mol-1 were achieved in predictions of the activation energy with respect to DFT computations. By virtue of its design, this strategy is generalisable to other ML models, to experimental data and to any catalytic asymmetric reaction, enabling the rapid screening of structurally diverse organocatalysts from available structural information.

| S-EPMC8153079 | biostudies-literature

OmicsDI is part of the ELIXIR infrastructure

OmicsDI is an Elixir interoperability service. Learn more ›

Tweets

OmicsDI Databases

PRIDE
PeptideAtlas
MassIVE
JPOST Repository
Physiome Model Repository

EGA
EVA
ENA
LINCS
PAXDB
Cell Collective

MetaboLights
Metabolomics Workbench
MetabolomeExpress
GNPS
BioModels
FAIRDOMHub

ArrayExpress
dbGaP
ExpressionAtlas
GEO
NODE

Information

Databases
Help
API
Contact us
Code on GitHub
Terms of use
Submit Data