Dataset Information

A semi-supervised learning framework for quantitative structure-activity regression modelling.

ABSTRACT:

Motivation

Quantitative structure-activity relationship (QSAR) methods are increasingly used in assisting the process of preclinical, small molecule drug discovery. Regression models are trained on data consisting of a finite-dimensional representation of molecular structures and their corresponding target-specific activities. These supervised learning models can then be used to predict the activity of previously unmeasured novel compounds.

Results

This work provides methods that solve three problems in QSAR modelling: (i) a method for comparing the information content between finite-dimensional representations of molecular structures (fingerprints) with respect to the target of interest, (ii) a method that quantifies how the accuracy of the model prediction degrades as a function of the distance between the testing and training data and (iii) a method to adjust for screening dependent selection bias inherent in many training datasets. For example, in the most extreme cases, only compounds which pass an activity-dependent screening threshold are reported. A semi-supervised learning framework combines (ii) and (iii) and can make predictions, which take into account the similarity of the testing compounds to those in the training data and adjust for the reporting selection bias. We illustrate the three methods using publicly available structure-activity data for a large set of compounds reported by GlaxoSmithKline (the Tres Cantos AntiMalarial Set, TCAMS) to inhibit asexual in vitro Plasmodium falciparum growth.

Availabilityand implementation

https://github.com/owatson/PenalizedPrediction.

Supplementary information

Supplementary data are available at Bioinformatics online.

SUBMITTER: Watson O

PROVIDER: S-EPMC8058768 | biostudies-literature | 2021 Apr

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

A semi-supervised learning framework for quantitative structure-activity regression modelling.

Watson Oliver O Cortes-Ciriano Isidro I Watson James A JA

Bioinformatics (Oxford, England) 20210401 3

<h4>Motivation</h4>Quantitative structure-activity relationship (QSAR) methods are increasingly used in assisting the process of preclinical, small molecule drug discovery. Regression models are trained on data consisting of a finite-dimensional representation of molecular structures and their corresponding target-specific activities. These supervised learning models can then be used to predict the activity of previously unmeasured novel compounds.<h4>Results</h4>This work provides methods that ...[more]

PMID: 32777821

Dataset Information

A semi-supervised learning framework for quantitative structure-activity regression modelling.

Motivation

Results

Availabilityand implementation

Supplementary information

Publications

A semi-supervised learning framework for quantitative structure-activity regression modelling.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

Learning to propagate labels on graphs: An iterative multitask regression framework for semi-supervised hyperspectral dimensionality reduction.
| S-EPMC6894308 | biostudies-literature

Solo: doublet identification via semi-supervised deep learning
2019-11-13 | GSE140262 | GEO

Semi-supervised learning framework for oil and gas pipeline failure detection.
| S-EPMC9374783 | biostudies-literature

Detecting m6A RNA modification from nanopore sequencing using a semi-supervised learning framework.
| S-EPMC10802372 | biostudies-literature

Semi-supervised empirical Bayes group-regularized factor regression.
| S-EPMC9796498 | biostudies-literature

Increasing the accuracy of single sequence prediction methods using a deep semi-supervised learning framework.
| S-EPMC8570780 | biostudies-literature

meth-SemiCancer: a cancer subtype classification framework via semi-supervised learning utilizing DNA methylation profiles
| S-EPMC10131478 | biostudies-literature

Deep semi-supervised learning ensemble framework for classifying co-mentions of human proteins and phenotypes.
| S-EPMC8520253 | biostudies-literature

Semi-Supervised Non-Parametric Bayesian Modelling of Spatial Proteomics.
| S-EPMC7613899 | biostudies-literature

Solo: doublet identification via semi-supervised deep learning
| PRJNA589061 | ENA