Dataset Information

SIMLIN: a bioinformatics tool for prediction of S-sulphenylation in the human proteome based on multi-stage ensemble-learning models.

ABSTRACT: BACKGROUND:S-sulphenylation is a ubiquitous protein post-translational modification (PTM) where an S-hydroxyl (-SOH) bond is formed via the reversible oxidation on the Sulfhydryl group of cysteine (C). Recent experimental studies have revealed that S-sulphenylation plays critical roles in many biological functions, such as protein regulation and cell signaling. State-of-the-art bioinformatic advances have facilitated high-throughput in silico screening of protein S-sulphenylation sites, thereby significantly reducing the time and labour costs traditionally required for the experimental investigation of S-sulphenylation. RESULTS:In this study, we have proposed a novel hybrid computational framework, termed SIMLIN, for accurate prediction of protein S-sulphenylation sites using a multi-stage neural-network based ensemble-learning model integrating both protein sequence derived and protein structural features. Benchmarking experiments against the current state-of-the-art predictors for S-sulphenylation demonstrated that SIMLIN delivered competitive prediction performance. The empirical studies on the independent testing dataset demonstrated that SIMLIN achieved 88.0% prediction accuracy and an AUC score of 0.82, which outperforms currently existing methods. CONCLUSIONS:In summary, SIMLIN predicts human S-sulphenylation sites with high accuracy thereby facilitating biological hypothesis generation and experimental validation. The web server, datasets, and online instructions are freely available at http://simlin.erc.monash.edu/ for academic purposes.

SUBMITTER: Wang X

PROVIDER: S-EPMC6868744 | biostudies-literature | 2019 Nov

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

SIMLIN: a bioinformatics tool for prediction of S-sulphenylation in the human proteome based on multi-stage ensemble-learning models.

Wang Xiaochuan X Li Chen C Li Fuyi F Sharma Varun S VS Song Jiangning J Webb Geoffrey I GI

BMC bioinformatics 20191121 1

<h4>Background</h4>S-sulphenylation is a ubiquitous protein post-translational modification (PTM) where an S-hydroxyl (-SOH) bond is formed via the reversible oxidation on the Sulfhydryl group of cysteine (C). Recent experimental studies have revealed that S-sulphenylation plays critical roles in many biological functions, such as protein regulation and cell signaling. State-of-the-art bioinformatic advances have facilitated high-throughput in silico screening of protein S-sulphenylation sites, ...[more]

PMID: 31752668

Similar Datasets

Project description:Background and objectiveDiabetes is a life-threatening chronic disease with a growing global prevalence, necessitating early diagnosis and treatment to prevent severe complications. Machine learning has emerged as a promising approach for diabetes diagnosis, but challenges such as limited labeled data, frequent missing values, and dataset imbalance hinder the development of accurate prediction models. Therefore, a novel framework is required to address these challenges and improve performance.MethodsIn this study, we propose an innovative pipeline-based multi-classification framework to predict diabetes in three classes: diabetic, non-diabetic, and prediabetes, using the imbalanced Iraqi Patient Dataset of Diabetes. Our framework incorporates various pre-processing techniques, including duplicate sample removal, attribute conversion, missing value imputation, data normalization and standardization, feature selection, and k-fold cross-validation. Furthermore, we implement multiple machine learning models, such as k-NN, SVM, DT, RF, AdaBoost, and GNB, and introduce a weighted ensemble approach based on the Area Under the Receiver Operating Characteristic Curve (AUC) to address dataset imbalance. Performance optimization is achieved through grid search and Bayesian optimization for hyper-parameter tuning.ResultsOur proposed model outperforms other machine learning models, including k-NN, SVM, DT, RF, AdaBoost, and GNB, in predicting diabetes. The model achieves high average accuracy, precision, recall, F1-score, and AUC values of 0.9887, 0.9861, 0.9792, 0.9851, and 0.999, respectively.ConclusionOur pipeline-based multi-classification framework demonstrates promising results in accurately predicting diabetes using an imbalanced dataset of Iraqi diabetic patients. The proposed framework addresses the challenges associated with limited labeled data, missing values, and dataset imbalance, leading to improved prediction performance. This study highlights the potential of machine learning techniques in diabetes diagnosis and management, and the proposed framework can serve as a valuable tool for accurate prediction and improved patient care. Further research can build upon our work to refine and optimize the framework and explore its applicability in diverse datasets and populations.

Project description:Accurate prognostic prediction is crucial for treatment decision-making in lung papillary adenocarcinoma (LPADC). The aim of this study was to predict cancer-specific survival in LPADC using ensemble machine learning and classical Cox regression models. Moreover, models were evaluated to provide recommendations based on quantitative data for personalized treatment of LPADC. Data of patients diagnosed with LPADC (2004-2018) were extracted from the Surveillance, Epidemiology, and End Results database. The set of samples was randomly divided into the training and validation sets at a ratio of 7:3. Three ensemble models were selected, namely gradient boosting survival (GBS), random survival forest (RSF), and extra survival trees (EST). In addition, Cox proportional hazards (CoxPH) regression was used to construct the prognostic models. The Harrell's concordance index (C-index), integrated Brier score (IBS), and area under the time-dependent receiver operating characteristic curve (time-dependent AUC) were used to evaluate the performance of the predictive models. A user-friendly web access panel was provided to easily evaluate the model for the prediction of survival and treatment recommendations. A total of 3615 patients were randomly divided into the training and validation cohorts (n = 2530 and 1085, respectively). The extra survival trees, RSF, GBS, and CoxPH models showed good discriminative ability and calibration in both the training and validation cohorts (mean of time-dependent AUC: > 0.84 and > 0.82; C-index: > 0.79 and > 0.77; IBS: < 0.16 and < 0.17, respectively). The RSF and GBS models were more consistent than the CoxPH model in predicting long-term survival. We implemented the developed models as web applications for deployment into clinical practice (accessible through https://shinyshine-820-lpaprediction-model-z3ubbu.streamlit.app/ ). All four prognostic models showed good discriminative ability and calibration. The RSF and GBS models exhibited the highest effectiveness among all models in predicting the long-term cancer-specific survival of patients with LPADC. This approach may facilitate the development of personalized treatment plans and prediction of prognosis for LPADC.

Project description:Visceral Leishmaniasis (VL) is a neglected tropical disease of public health importance in the Indian subcontinent. Despite consistent elimination initiatives, the disease has not yet been eliminated and there is an increased risk of resurgence from active VL reservoirs including asymptomatic, post kala azar dermatitis leishmaniasis (PKDL) and HIV-VL co-infected individuals. To achieve complete elimination and sustain it in the long term, a prophylactic vaccine, which can elicit long lasting immunity, is desirable. In this study, we employed immunoinformatic tools to design a multi-subunit epitope vaccine for the Indian population by targeting antigenic secretory proteins screened from the Leishmania donovani proteome. Out of 8014 proteins, 277 secretory proteins were screened for their cellular location and proteomic evidence. Through NCBI BlastP, unique fragments of the proteins were cropped, and their antigenicity was evaluated. B-cell, HTL and CTL epitopes as well as IFN-ɣ, IL-17, and IL-10 inducers were predicted, manually mapped to the fragments and common regions were tabulated forming a peptide ensemble. The ensemble was evaluated for Class I MHC immunogenicity and toxicity. Further, immunogenic peptides were randomly selected and used to design vaccine constructs. Eight vaccine constructs were generated by linking random peptides with GS linkers. Synthetic TLR-4 agonist, RS09 was used as an adjuvant and linked with the constructs using EAAK linkers. The predicted population coverage of the constructs was ∼99.8 % in the Indian as well as South Asian populations. The most antigenic, nontoxic, non-allergic construct was chosen for the prediction of secondary and tertiary structures. The 3D structures were refined and analyzed using Ramachandran plot and Z-scores. The construct was docked with TLR-4 receptor. Molecular dynamic simulation was performed to check for the stability of the docked complex. Comparative in silico immune simulation studies showed that the predicted construct elicited humoral and cell-mediated immunity in human host comparable to that elicited by Leish-F3, which is a promising vaccine candidate for human VL.

Project description:Drought is a natural hazard, which is a result of a prolonged shortage of precipitation, high temperature and change in the weather pattern. Drought harms society, the economy and the natural environment, but it is difficult to identify and characterize. Many areas of Pakistan have suffered severe droughts during the last three decades due to changes in the weather pattern. A drought analysis with the incorporation of climate information has not yet been undertaken in this study region. Here, we propose an ensemble approach for monthly drought prediction and to define and examine wet/dry events. Initially, the drought events were identified by the short term Standardized Precipitation Index (SPI-3). Drought is predicted based on three ensemble models i.e., Equal Ensemble Drought Prediction (EEDP), Weighted Ensemble Drought Prediction (WEDP) and the Conditional Ensemble Drought Prediction (CEDP) model. Besides, two weighting procedures are used for distributing weights in the WEDP model, such as Traditional Weighting (TW) and the Weighted Bootstrap Resampling (WBR) procedure. Four copula families (i.e., Frank, Clayton, Gumbel and Joe) are used to explain the dependency relation between climate indices and precipitation in the CEDP model. Among all four copula families, the Joe copula has been found suitable for most of the times. The CEDP model provides better results in terms of accuracy and uncertainty as compared to other ensemble models for all meteorological stations. The performance of the CEDP model indicates that the climate indices are correlated with a weather pattern of four meteorological stations. Moreover, the percentage occurrence of extreme drought events that have appeared in the Multan, Bahawalpur, Barkhan and Khanpur are 1.44%, 0.57%, 2.59% and 1.71%, respectively, whereas the percentage occurrence of extremely wet events are 2.3%, 1.72%, 0.86% and 2.86%, respectively. The understanding of drought pattern by including climate information can contribute to the knowledge of future agriculture and water resource management.

Project description:BackgroundThe recent use of artificial intelligence (AI) in medical research is noteworthy. However, most research has focused on medical imaging. Although the importance of laboratory tests in the clinical field is acknowledged by clinicians, they are undervalued in medical AI research. Our study aims to develop an early prediction AI model for pneumonia mortality, primarily using laboratory test results.Materials and methodsWe developed a mortality prediction model using initial laboratory results and basic clinical information of patients with pneumonia. Several machine learning (ML) models and a deep learning method-multilayer perceptron (MLP)-were selected for model development. The area under the receiver operating characteristic curve (AUROC) and F1-score were optimized to improve model performance. In addition, an ensemble model was developed by blending several models to improve the prediction performance. We used 80,940 data instances for model development.ResultsAmong the ML models, XGBoost exhibited the best performance (AUROC = 0.8989, accuracy = 0.88, F1-score = 0.80). MLP achieved an AUROC of 0.8498, accuracy of 0.86, and F1-score of 0.75. The performance of the ensemble model was the best among the developed models, with an AUROC of 0.9006, accuracy of 0.90, and F1-score of 0.81. Several laboratory tests were conducted to identify risk factors that affect pneumonia mortality using the "Feature importance" technique and SHapley Additive exPlanations. We identified several laboratory results, including systolic blood pressure, serum glucose level, age, aspartate aminotransferase-to-alanine aminotransferase ratio, and monocyte-to-lymphocyte ratio, as significant predictors of mortality in patients with pneumonia.ConclusionsOur study demonstrates that the ensemble model, incorporating XGBoost, CatBoost, and LGBM techniques, outperforms individual ML and deep learning models in predicting pneumonia mortality. Our findings emphasize the importance of integrating AI techniques to leverage laboratory test data effectively, offering a promising direction for advancing AI applications in medical research and clinical decision-making.

Dataset Information

SIMLIN: a bioinformatics tool for prediction of S-sulphenylation in the human proteome based on multi-stage ensemble-learning models.

Publications

SIMLIN: a bioinformatics tool for prediction of S-sulphenylation in the human proteome based on multi-stage ensemble-learning models.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets