Project description:Gene expression profiles were generated from 199 primary breast cancer patients. Samples 1-176 were used in another study, GEO Series GSE22820, and form the training data set in this study. Sample numbers 200-222 form a validation set. This data is used to model a machine learning classifier for Estrogen Receptor Status. RNA was isolated from 199 primary breast cancer patients. A machine learning classifier was built to predict ER status using only three gene features.
Project description:BackgroundChemicals may lead to acute liver injuries, posing a serious threat to human health. Achieving the precise safety profile of a compound is challenging due to the complex and expensive testing procedures. In silico approaches will aid in identifying the potential risk of drug candidates in the initial stage of drug development and thus mitigating the developmental cost.MethodsIn current studies, QSAR models were developed for hepatotoxicity predictions using the ensemble strategy to integrate machine learning (ML) and deep learning (DL) algorithms using various molecular features. A large dataset of 2588 chemicals and drugs was randomly divided into training (80%) and test (20%) sets, followed by the training of individual base models using diverse machine learning or deep learning based on three different kinds of descriptors and fingerprints. Feature selection approaches were employed to proceed with model optimizations based on the model performance. Hybrid ensemble approaches were further utilized to determine the method with the best performance.ResultsThe voting ensemble classifier emerged as the optimal model, achieving an excellent prediction accuracy of 80.26%, AUC of 82.84%, and recall of over 93% followed by bagging and stacking ensemble classifiers method. The model was further verified by an external test set, internal 10-fold cross-validation, and rigorous benchmark training, exhibiting much better reliability than the published models.ConclusionThe proposed ensemble model offers a dependable assessment with a good performance for the prediction regarding the risk of chemicals and drugs to induce liver damage.
Project description:ObjectiveDrug-drug interactions (DDIs) are an important consideration in both drug development and clinical application, especially for co-administered medications. While it is necessary to identify all possible DDIs during clinical trials, DDIs are frequently reported after the drugs are approved for clinical use, and they are a common cause of adverse drug reactions (ADR) and increasing healthcare costs. Computational prediction may assist in identifying potential DDIs during clinical trials.MethodsHere we propose a heterogeneous network-assisted inference (HNAI) framework to assist with the prediction of DDIs. First, we constructed a comprehensive DDI network that contained 6946 unique DDI pairs connecting 721 approved drugs based on DrugBank data. Next, we calculated drug-drug pair similarities using four features: phenotypic similarity based on a comprehensive drug-ADR network, therapeutic similarity based on the drug Anatomical Therapeutic Chemical classification system, chemical structural similarity from SMILES data, and genomic similarity based on a large drug-target interaction network built using the DrugBank and Therapeutic Target Database. Finally, we applied five predictive models in the HNAI framework: naive Bayes, decision tree, k-nearest neighbor, logistic regression, and support vector machine, respectively.ResultsThe area under the receiver operating characteristic curve of the HNAI models is 0.67 as evaluated using fivefold cross-validation. Using antipsychotic drugs as an example, several HNAI-predicted DDIs that involve weight gain and cytochrome P450 inhibition were supported by literature resources.ConclusionsThrough machine learning-based integration of drug phenotypic, therapeutic, structural, and genomic similarities, we demonstrated that HNAI is promising for uncovering DDIs in drug development and postmarketing surveillance.
Project description:BackgroundOsteoarthritis (OA) is a common cause of disability among the elderly, profoundly affecting quality of life. This study aims to leverage bioinformatics and machine learning to develop an artificial neural network (ANN) model for diagnosing OA, providing new avenues for early diagnosis and treatment.MethodsFrom the Gene Expression Omnibus (GEO) database, we first obtained OA synovial tissue microarray datasets. Differentially expressed genes (DEGs) associated with OA were identified through utilization of the Limma package and weighted gene co-expression network analysis (WGCNA). Subsequently, protein-protein interaction (PPI) network analysis and machine learning were employed to identify the most relevant potential feature genes of OA, and ANN diagnostic model and receiver operating characteristic (ROC) curve were constructed to evaluate the diagnostic performance of the model. In addition, the expression levels of the feature genes were verified using real-time quantitative polymerase chain reaction (qRT-PCR). Finally, immune cell infiltration analysis was performed using CIBERSORT algorithm to explore the correlation between feature genes and immune cells.ResultsThe Limma package and WGCNA identified a total of 72 DEGs related to OA, of which 12 were up-regulated and 60 were down-regulated. Then, the PPI network analysis identified 21 hub genes, and three machine learning algorithms finally screened four feature genes (BTG2, CALML4, DUSP5, and GADD45B). The ANN diagnostic model was constructed based on these four feature genes. The AUC of the training set was 0.942, and the AUC of the validation set was 0.850. In addition, the qRT-PCR validation results demonstrated a significant downregulation of BTG2, DUSP5, and GADD45 mRNA expression levels in OA samples compared to normal samples, while CALML4 mRNA expression level exhibited an upregulation. Immune cell infiltration analysis revealed B cells memory, T cells gamma delta, B cells naive, Plasma cells, T cells CD4 memory resting, and NK cells The abnormal infiltration of activated cells may be related to the progression of OA.ConclusionsBTG2, CALML4, DUSP5, and GADD45B were identified as potential feature genes for OA, and an ANN diagnostic model with good diagnostic performance was developed, providing a new perspective for the early diagnosis and personalized treatment of OA.
Project description:ObjectiveThis study aimed to identify key clock genes closely associated with major depressive disorder (MDD) using bioinformatics and machine learning approaches.MethodsGene expression data of 128 MDD patients and 64 healthy controls from blood samples were obtained. Differentially expressed were identified and weighted gene co-expression network analysis (WGCNA) was first performed to screen MDD-related key genes. These genes were then intersected with 1475 known circadian rhythm genes to identify circadian rhythm genes associated with MDD. Finally, multiple machine learning algorithms were applied for further selection, to determine the most critical 4 circadian rhythm biomarkers.ResultsFour key circadian rhythm genes (ABCC2, APP, HK2 and RORA) were identified that could effectively distinguish MDD samples from controls. These genes were significantly enriched in circadian pathways and showed strong correlations with immune cell infiltration. Drug target prediction suggested that small molecules like melatonin and escitalopram may target these circadian rhythm proteins.ConclusionThis study revealed discovered 4 key circadian rhythm genes closely associated with MDD, which may serve as diagnostic biomarkers and therapeutic targets. The findings highlight the important roles of circadian disruptions in the pathogenesis of MDD, providing new insights for precision diagnosis and targeted treatment of MDD.
Project description:Genomic prediction (GP) aims to construct a statistical model for predicting phenotypes using genome-wide markers and is a promising strategy for accelerating molecular plant breeding. However, current progress of phenotype prediction using genomic data alone has reached a bottleneck, and previous studies on transcriptomic and metabolomic predictions ignored genomic information. Here, we designed a novel strategy of GP called multilayered least absolute shrinkage and selection operator (MLLASSO) by integrating multiple omic data into a single model that iteratively learns three layers of genetic features (GFs) supervised by observed transcriptome and metabolome. Significantly, MLLASSO learns higher order information of gene interactions, which enables us to achieve a significant improvement of predictability of yield in rice from 0.1588 (GP alone) to 0.2451 (MLLASSO). In the prediction of the first two layers, some genes were found to be genetically predictable genes (GPGs) as their expressions were accurately predicted with genetic markers. Interestingly, we made three dramatic discoveries for the GPGs: (i) GPGs are good predictors for highly complex traits like yield; (ii) GPGs are mostly eQTL genes (cis or trans); and (iii) trait-related transcriptional factor families are enriched in GPGs. These findings support the notion that learned GFs not only are good predictors for traits but also have specific biological implications regarding regulation of gene expressions. To differentiate the new method from conventional GP models, we called MLLASSO a directed learning strategy supervised by intermediate omic data. This new prediction model appears to be more reliable and more robust than conventional GP models.
Project description:Breast cancer is the most common malignancy in women, and because it has a high mortality rate, it is urgent to develop computational methods to increase the accuracy of breast cancer survival predictive models. Although multi-omics data such as gene expression have been extensively used in recent studies, the accurate prognosis of breast cancer remains a challenge. Somatic mutations are another important and promising data source for studying cancer development, and its effect on the prognosis of breast cancer remains to be further explored. Meanwhile, these omics datasets are high-dimensional and redundant. Therefore, we adopted multiple kernel learning (MKL) to efficiently integrate somatic mutation to currently molecular data including gene expression, copy number variation (CNV), methylation, and protein expression data for the prediction of breast cancer survival. Before integration, the maximum relevance minimum redundancy (mRMR) feature selection method was utilized to select features that present high relevance to survival and low redundancy among themselves for each type of data. The experimental results demonstrated that the proposed method achieved the most optimal performance and there was a remarkable improvement in the prediction performance when somatic mutations were included, indicating that somatic mutations are critical for improving breast cancer survival predictions. Moreover, mRMR was superior to other feature selection methods used in previous studies. Furthermore, MKL outperformed the other traditional classifiers in multi-omics data integration. Our analysis indicated that through employing promising omics data such as somatic mutations and harnessing the power of proper feature selection methods and effective integration frameworks, the breast cancer survival predictive accuracy can be further increased, thereby providing a more optimal clinical diagnosis and more effective treatment for breast cancer patients.
Project description:Accurate identification of protein domain boundaries is useful for protein structure determination and prediction. However, predicting protein domain boundaries from a sequence is still very challenging and largely unsolved.We developed a new method to integrate the classification power of machine learning with evolutionary signals embedded in protein families in order to improve protein domain boundary prediction. The method first extracts putative domain boundary signals from a multiple sequence alignment between a query sequence and its homologs. The putative sites are then classified and scored by support vector machines in conjunction with input features such as sequence profiles, secondary structures, solvent accessibilities around the sites and their positions. The method was evaluated on a domain benchmark by 10-fold cross-validation and 60% of true domain boundaries can be recalled at a precision of 60%. The trade-off between the precision and recall can be adjusted according to specific needs by using different decision thresholds on the domain boundary scores assigned by the support vector machines.The good prediction accuracy and the flexibility of selecting domain boundary sites at different precision and recall values make our method a useful tool for protein structure determination and modelling. The method is available at http://sysbio.rnet.missouri.edu/dobo/.
Project description:BackgroundSepsis is a life-threatening disease causing millions of deaths every year. It has been reported that programmed cell death (PCD) plays a critical role in the development and progression of sepsis, which has the potential to be a diagnosis and prognosis indicator for patient with sepsis.MethodsFourteen PCD patterns were analyzed for model construction. Seven transcriptome datasets and a single cell sequencing dataset were collected from the Gene Expression Omnibus database.ResultsA total of 289 PCD-related differentially expressed genes were identified between sepsis patients and healthy individuals. The machine learning algorithm screened three PCD-related genes, NLRC4, TXN and S100A9, as potential biomarkers for sepsis. The area under curve of the diagnostic model reached 100.0% in the training set and 100.0%, 99.9%, 98.9%, 99.5% and 98.6% in five validation sets. Furthermore, we verified the diagnostic genes in sepsis patients from our center via qPCR experiment. Single cell sequencing analysis revealed that NLRC4, TXN and S100A9 were mainly expressed on myeloid/monocytes and dendritic cells. Immune infiltration analysis revealed that multiple immune cells involved in the development of sepsis. Correlation and gene set enrichment analysis (GSEA) analysis revealed that the three biomarkers were significantly associated with immune cells infiltration.ConclusionsWe developed and validated a diagnostic model for sepsis based on three PCD-related genes. Our study might provide potential peripheral blood diagnostic candidate biomarkers for patients with sepsis.
Project description:BackgroundOsteoarthritis (OA) is one of the main causes of pain and disability in the world, it may be caused by many factors. Aging plays a significant role in the onset and progression of OA. However, the mechanisms underlying it remain unknown. Our research aimed to uncover the role of aging-related genes in the progression of OA.MethodsIn Human OA datasets and aging-related genes were obtained from the GEO database and the HAGR website, respectively. Bioinformatics methods including Gene Ontology (GO), Kyoto Encyclopedia of Genes Genomes (KEGG) pathway enrichment, and Protein-protein interaction (PPI) network analysis were used to analyze differentially expressed aging-related genes (DEARGs) in the normal control group and the OA group. And then weighted gene coexpression network analysis (WGCNA), the least absolute shrinkage and selection operator (LASSO) regression, and the Random Forest (RF) machine learning algorithms were used to find the hub genes.ResultsFour overlapping hub genes: HMGB2, CDKN1A, JUN, and DDIT3 were identified. According to the nomogram model and receiver operating characteristic (ROC) curve analysis, four hub DEARGs had good diagnostic value in distinguishing normal from OA. Furthermore, the qRT-PCR test demonstrated that HMGB2, CDKN1A, JUN, and DDIT3 mRNA expression levels were lower in OA group than in normal group.ConclusionFinally, these four-hub aging-related genes may help us understand the underlying mechanism of aging in osteoarthritis and could be used as possible diagnostic and therapeutic targets.