Dataset Information

Improved cytokine-receptor interaction prediction by exploiting the negative sample space.

ABSTRACT:

Background

Cytokines act by binding to specific receptors in the plasma membrane of target cells. Knowledge of cytokine-receptor interaction (CRI) is very important for understanding the pathogenesis of various human diseases-notably autoimmune, inflammatory and infectious diseases-and identifying potential therapeutic targets. Recently, machine learning algorithms have been used to predict CRIs. "Gold Standard" negative datasets are still lacking and strong biases in negative datasets can significantly affect the training of learning algorithms and their evaluation. To mitigate the unrepresentativeness and bias inherent in the negative sample selection (non-interacting proteins), we propose a clustering-based approach for representative negative sample selection.

Results

We used deep autoencoders to investigate the effect of different sampling approaches for non-interacting pairs on the training and the performance of machine learning classifiers. By using the anomaly detection capabilities of deep autoencoders we deduced the effects of different categories of negative samples on the training of learning algorithms. Random sampling for selecting non-interacting pairs results in either over- or under-representation of hard or easy to classify instances. When K-means based sampling of negative datasets is applied to mitigate the inadequacies of random sampling, random forest (RF) together with the combined feature set of atomic composition, physicochemical-2grams and two different representations of evolutionary information performs best. Average model performances based on leave-one-out cross validation (loocv) over ten different negative sample sets that each model was trained with, show that RF models significantly outperform the previous best CRI predictor in terms of accuracy (+?5.1%), specificity (+?13%), mcc (+?0.1) and g-means value (+?5.1). Evaluations using tenfold cv and training/testing splits confirm the competitive performance.

Conclusions

A comparative analysis was performed to assess the effect of three different sampling methods (random, K-means and uniform sampling) on the training of learning algorithms using different evaluation methods. Models trained on K-means sampled datasets generally show a significantly improved performance compared to those trained on random selections-with RF seemingly benefiting most in our particular setting. Our findings on the sampling are highly relevant and apply to many applications of supervised learning approaches in bioinformatics.

SUBMITTER: Nath A

PROVIDER: S-EPMC7603689 | biostudies-literature | 2020 Oct

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Improved cytokine-receptor interaction prediction by exploiting the negative sample space.

Nath Abhigyan A Leier André A

BMC bioinformatics 20201031 1

<h4>Background</h4>Cytokines act by binding to specific receptors in the plasma membrane of target cells. Knowledge of cytokine-receptor interaction (CRI) is very important for understanding the pathogenesis of various human diseases-notably autoimmune, inflammatory and infectious diseases-and identifying potential therapeutic targets. Recently, machine learning algorithms have been used to predict CRIs. "Gold Standard" negative datasets are still lacking and strong biases in negative datasets c ...[more]

PMID: 33129275

Similar Datasets

Project description:BackgroundIn silico analyses are increasingly being used to support mode-of-action investigations; however many such approaches do not utilise the large amounts of inactive data held in chemogenomic repositories. The objective of this work is concerned with the integration of such bioactivity data in the target prediction of orphan compounds to produce the probability of activity and inactivity for a range of targets. To this end, a novel human bioactivity data set was constructed through the assimilation of over 195 million bioactivity data points deposited in the ChEMBL and PubChem repositories, and the subsequent application of a sphere-exclusion selection algorithm to oversample presumed inactive compounds.ResultsA Bernoulli Naïve Bayes algorithm was trained using the data and evaluated using fivefold cross-validation, achieving a mean recall and precision of 67.7 and 63.8 % for active compounds and 99.6 and 99.7 % for inactive compounds, respectively. We show the performances of the models are considerably influenced by the underlying intraclass training similarity, the size of a given class of compounds, and the degree of additional oversampling. The method was also validated using compounds extracted from WOMBAT producing average precision-recall AUC and BEDROC scores of 0.56 and 0.85, respectively. Inactive data points used for this test are based on presumed inactivity, producing an approximated indication of the true extrapolative ability of the models. A distance-based applicability domain analysis was also conducted; indicating an average Tanimoto Coefficient distance of 0.3 or greater between a test and training set can be used to give a global measure of confidence in model predictions. A final comparison to a method trained solely on active data from ChEMBL performed with precision-recall AUC and BEDROC scores of 0.45 and 0.76.ConclusionsThe inclusion of inactive data for model training produces models with superior AUC and improved early recognition capabilities, although the results from internal and external validation of the models show differing performance between the breadth of models. The realised target prediction protocol is available at https://github.com/lhm30/PIDGIN.Graphical abstractThe inclusion of large scale negative training data for in silico target prediction improves the precision and recall AUC and BEDROC scores for target models.

Project description:IntroductionAn emerging hypothesis suggests that cytokines could play an important role in cancer as potential modulators of angiogenesis and leucocyte infiltration.MethodsA novel multiplexed flow cytometry technology was used to measure the expression of 17 cytokines (IL-1beta, IL-2, IL-4, IL-5, IL-6, IL-7, IL-8, IL-10, IL-12 [p70], IL-13, IL-17, granulocyte colony-stimulating factor [CSF], granulocyte-macrophage CSF, IFN-gamma, monocyte chemoattractant protein [MCP]-1, macrophage inflammatory protein [MIP]-1beta, tumour necrosis factor [TNF]-alpha) at the protein level in 105 breast carcinoma. B lymphocyte, T lymphocyte and macrophage levels were determined by immunohistochemistry.ResultsFourteen of the 17 cytokines were expressed in breast carcinoma, whereas only nine cytokines could be detected in normal breast. Most cytokines were more abundant in breast carcinoma than in normal breast, with IL-6, IL-8, granulocyte CSF, IFN-gamma, MCP-1 and MIP-1beta being very abundant. IL-2, IL-6, IL-8, IL-10, IFN-gamma, MCP-1, MIP-1beta and TNF-alpha, and to a lesser extent IL-1beta and IL-13 exhibited levels of expression that were inversely correlated to oestrogen receptor and progesterone receptor status. Most cytokines were not correlated with age at cancer diagnosis, tumour size, histological type, or lymph node status. However, IL-1beta, IL-6, IL-8, IL-10, IL-12, MCP-1 and MIP-1beta were more abundant in high-grade tumours than in low-grade tumours. In addition, IL-8 and MIP-1beta were expressed to a greater degree in HER2-positive than in HER2-negative patients. The expression of most of the studied cytokines was correlated to levels of activator protein-1, which is known to regulate numerous cytokines. Overexpression of MCP-1 and MIP-1beta were linked to B lymphocyte, T lymphocyte and macrophage infiltration, whereas high levels of IL-8 were correlated with high macrophage content in tumour. Moreover, IL-8 positive tumours exhibited increased vascularization.ConclusionWe found that multiple cytokines were overexpressed in oestrogen receptor negative breast carcinoma, and that the three major cytokines--MCP-1, MIP-1beta and IL-8--were correlated with inflammatory cell component, which could account for the aggressiveness of these tumours.

Dataset Information

Improved cytokine-receptor interaction prediction by exploiting the negative sample space.

Background

Results

Conclusions

Publications

Improved cytokine-receptor interaction prediction by exploiting the negative sample space.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets