Project description:Using a public reference data set of 82 unique entities, 382 nanopore-sequenced brain tumor samples were classified based on their methylation status through an ad hoc random forest algorithm. As a measure of confidence, score recalibration was performed and platform-specific thresholds were defined.
Project description:A Random Forest model is developed to incorporate tumor mutation data within the context of the biological process known as leukocyte proliferation regulation. This model aims to predict a patient's response to anti-PD1 treatment.
The authors conducted experiments using four different types of classifiers: Random Forest, Gradient Boosting, Feed Forward Neural Network, and Long Short-Term Memory (LSTM) recurrent neural network. Among these classifiers, the Random Forest algorithm yielded the best predictive performance when modeling gene mutation data associated with the 'leukocyte proliferation regulation' biological process. Hence, this curated version of the model focuses on the Random Forest model trained specifically on the 'Leukocyte Proliferation Regulation' process.
In this model, a value of '0' is assigned to NonResponders, while a value of '1' is assigned to Responders. Please note that to obtain predictions, users should provide mutation data containing only the genes corresponding to the 'GO_REGULATION_OF_LEUKOCYTE_PROLIFERATION' process keyword, as specified in the 'GO_test_genes_dict_intersection' dictionary.
Project description:We examined published microarray data from 104 acute lymphoblastic leukaemia patient specimens, that represent six different subgroups defined by cytogenetic features and immunophenotypes. Using the decision-tree based supervised learning algorithm Random Forest (RF), we determined a small set of genes for optimal subgroup distinction and subsequently validated their predictive power in an independent cohort of 68 specimens that were assessed using Affymetrix HG-U133A arrays.
Project description:Transcriptomic and proteomic data from human cells infected with Dengue virus was used to infer a number of networks to determine which network inference methods were best for linking protiens and transcripts in the same network. GENIE3, a random forest method, was found to be the best and once inferred with this method networks were interrogated to gain knowledge regarding host pathogen interactions surrounding Dengue infection.
Project description:Objectives Our goal was to evaluate the diagnostic value of DNA methylation analysis in combination with machine learning to differentiate pleural mesothelioma (PM) from important histopathological mimics. Material and methods DNA methylation data of PM, lung adenocarcinomas, lung squamous cell carcinomas and chronic pleuritis was used to train a random forest as well as a support vector machine. These classifiers were validated using an independent validation cohort including pleural carcinosis and pleomorphic variants of lung adeno- and squamous cell carcinomas. Furthermore, we used a deconvolution method to estimate the composition of the tumor microenvironment. Results T-distributed stochastic neighbor embedding clearly separated PM from lung adenocarcinomas and squamous cell carcinomas, but there was a considerable overlap between chronic pleuritis specimens and PM with low tumor cell content. While both machine learning algorithms achieved comparable accuracies in a nested cross validation on the training cohort (random forest: 94.9%; support vector machine: 95.5%), the support vector machine outperformed the random forest in distinguishing PM from chronic pleuritis. Differential methylation analysis revealed promoter hypermethylation in PM specimens, including the tumor suppressor genes BCL11B, EBF1, FOXA1, and WNK2. Furthermore, we observed comparable accuracies for the support vector machine on the validation cohort (97.1%) while the random forest performed considerably worse (89.9%). Deconvolution of the stromal and immune cell composition revealed higher rates of regulatory T-cells and endothelial cells in tumor specimens and a heterogenous inflammation including macrophages, B-cells and natural killer cells in chronic pleuritis. Conclusion DNA methylation in combination with machine learning is a promising tool to reliably differentiate PM from chronic pleuritis and lung cancer, including pleomorphic carcinomas. Furthermore, our study highlights new candidate genes for PM carcinogenesis and shows that deconvolution of DNA methylation data can provide reasonable insights into the composition of the tumor microenvironment.
Project description:Immunotherapy has improved the prognosis of patients with advanced non-small cell lung
cancer (NSCLC), but only a small subset of patients achieved clinical benefit. The purpose of our study was to integrate multidimensional data using a machine learning method to predict the therapeutic efficacy of immune checkpoint inhibitors (ICIs) monotherapy in patients with advanced NSCLC.The authors retrospectively enrolled 112 patients with stage IIIB-IV NSCLC receiving ICIs monotherapy. The random forest (RF) algorithm was used to establish efficacy prediction models based on five different input datasets, including precontrast computed tomography (CT) radiomic data, postcontrast CT radiomic data, combination of the two CT radiomic data, clinical data, and a combination of radiomic and clinical data. The 5-fold cross-validation was used to train and test the random forest classifier. The performance of the models was assessed according to the area under the curve (AUC) in the receiver operating characteristic (ROC) curve. Among these models(RF MLP LR XGBoost), our reproduced onnx models have better performance, especially for random forest. The response variable with a value (1/0) indicates the (efficacy/inefficacy) of PD-1/PD-L1 monotherapy in patients with advanced NSCLC
Project description:Transcriptional enhancers play critical roles in regulation of gene expression, but their identification has remained a challenge. Recently, it was shown that enhancers in the mammalian genome are associated with characteristic histone modification patterns, which have been increasingly exploited for enhancer identification. However, only a limited number of histone modifications have previously been investigated for this purpose, leaving the questions answered whether there exist an optimal set of histone modifications that could improve the enhancer prediction. Here, we address this issue by exploring a rich dataset produced by the human Epigenome Roadmap Project. Specifically, we examined genome-wide profiles of 24 histone modifications in human embryonic stem cells and fibroblasts, and developed a Random-Forest based algorithm to integrate histone modification profiles for identification of enhancers.As a training set, we used histone modification profiles at genome-wide binding sites of p300 in the two cell types identified using ChIP-seq. We show that this algorithm not only leads to more accurate and precise prediction of enhancers than previous methods, but also helps identify an optimal set of three chromatin marks for enhancer prediction.
Project description:Top-down mass spectrometry (MS) is a powerful tool for identification and comprehensive characterization of proteoforms arising from alternative splicing, sequence variation, and post-translational modifications. While the technique is powerful, it suffered from the complex dataset generated from top-down MS experiments, which requires sequential data processing steps for data interpretation. Deconvolution of the complex isotopic distribution that arises from naturally occurring isotopes is a critical step in the data processing process. Multiple algorithms are currently available to deconvolute top-down mass spectra; however, each algorithm generates different deconvoluted peak lists with varied accuracy comparing to true positive annotations. In this study, we have designed a machine learning strategy that can process and combine the peak lists from different deconvolution results. By optimizing clustering results, deconvolution results from THRASH, TopFD, MS-Deconv, and SNAP algorithms were combined into consensus peak lists at various thresholds using either a simple voting ensemble method or a random forest machine learning algorithm. The random forest model outperformed the single best algorithm. This machine learning strategy could enhance the accuracy and confidence in protein identification during database search by accelerating detection of true positive peaks while filtering out false positive peaks. Thus, this method showed promises in enhancing proteoform identification and characterization for high-throughput data analysis in top-down proteomics.