Dataset Information

Developing prognostic gene panel of survival time in lung adenocarcinoma patients using machine learning

ABSTRACT:

Background

Transcriptome data generates massive amounts of information that can be used for characterization and prognosis of patient outcomes for many diseases. The goal of our research is to predict the survival time of lung adenocarcinoma patients and improve the accuracy of classifying the long-survival cohort and short-survival cohort.

Methods

We filtered prognostic features related with survival time of lung adenocarcinoma patients by the method of Relief and predicted whether survival time of the patient is >3 years or not—using eight machine learning algorithms (Support Vector Machines, Random Forests, Logistic Regression, Naïve Bayes, Linear Regression, Support Vector Regression (kernel Poly), Support Vector Regression (kernel Linear), and Ridge Regression). Then the best-performed algorithm was chosen to build a predictive model of survival time of lung adenocarcinoma patients. Further, another dataset was used to verify the stability and suitability of this model. We explored the underlying mechanisms of RNA expression changes with the corresponding DNA mutations and DNA methylation patterns in the 22 selected genetic features.

Results

The best machine learning algorithm was Naïve Bayes (accuracy=75%, AUC =0.81) using the top 22 genetic features, and this algorithm had the stable and great performance on another dataset as well. The coupled mutation number of the long-survival group (>6 years) was less than the short-survival group (<1 year) in 22 genes (P=0.031).

Conclusions

The expression of gene panel can predict the survival time of lung adenocarcinoma patients using Naïve Bayes. These 22 genes do affect the survival time of lung adenocarcinoma.

SUBMITTER: Liu Y

PROVIDER: S-EPMC8799101 | biostudies-literature |

REPOSITORIES: biostudies-literature

ACCESS DATA

Similar Datasets

Project description:Disulfidptosis represents a novel cell death mechanism triggered by disulfide stress, with potential implications for advancements in cancer treatments. Although emerging evidence highlights the critical regulatory roles of long non-coding RNAs (lncRNAs) in the pathobiology of lung adenocarcinoma (LUAD), research into lncRNAs specifically associated with disulfidptosis in LUAD, termed disulfidptosis-related lncRNAs (DRLs), remains insufficiently explored. Using The Cancer Genome Atlas (TCGA)-LUAD dataset, we implemented ten machine learning techniques, resulting in 101 distinct model configurations. To assess the predictive accuracy of our model, we employed both the concordance index (C-index) and receiver operating characteristic (ROC) curve analyses. For a deeper understanding of the underlying biological pathways, we referred to the Kyoto Encyclopedia of Genes and Genomes (KEGG) and Gene Ontology (GO) for functional enrichment analysis. Moreover, we explored differences in the tumor microenvironment between high-risk and low-risk patient cohorts. Additionally, we thoroughly assessed the prognostic value of the DRLs signatures in predicting treatment outcomes. The Kaplan-Meier (KM) survival analysis demonstrated a significant difference in overall survival (OS) between the high-risk and low-risk cohorts (p < 0.001). The prognostic model showed robust performance, with an area under the ROC curve exceeding 0.75 at one year and maintaining a value above 0.72 in the two and three-year follow-ups. Further research identified variations in tumor mutational burden (TMB) and differential responses to immunotherapies and chemotherapies. Our validation, using three GEO datasets (GSE31210, GSE30219, and GSE50081), revealed that the C-index exceeded 0.67 for GSE31210 and GSE30219. Significant differences in disease-free survival (DFS) and OS were observed across all validation cohorts among different risk groups. The prognostic model offers potential as a molecular biomarker for LUAD prognosis.

Project description:Accurate prognostic prediction is crucial for treatment decision-making in lung papillary adenocarcinoma (LPADC). The aim of this study was to predict cancer-specific survival in LPADC using ensemble machine learning and classical Cox regression models. Moreover, models were evaluated to provide recommendations based on quantitative data for personalized treatment of LPADC. Data of patients diagnosed with LPADC (2004-2018) were extracted from the Surveillance, Epidemiology, and End Results database. The set of samples was randomly divided into the training and validation sets at a ratio of 7:3. Three ensemble models were selected, namely gradient boosting survival (GBS), random survival forest (RSF), and extra survival trees (EST). In addition, Cox proportional hazards (CoxPH) regression was used to construct the prognostic models. The Harrell's concordance index (C-index), integrated Brier score (IBS), and area under the time-dependent receiver operating characteristic curve (time-dependent AUC) were used to evaluate the performance of the predictive models. A user-friendly web access panel was provided to easily evaluate the model for the prediction of survival and treatment recommendations. A total of 3615 patients were randomly divided into the training and validation cohorts (n = 2530 and 1085, respectively). The extra survival trees, RSF, GBS, and CoxPH models showed good discriminative ability and calibration in both the training and validation cohorts (mean of time-dependent AUC: > 0.84 and > 0.82; C-index: > 0.79 and > 0.77; IBS: < 0.16 and < 0.17, respectively). The RSF and GBS models were more consistent than the CoxPH model in predicting long-term survival. We implemented the developed models as web applications for deployment into clinical practice (accessible through https://shinyshine-820-lpaprediction-model-z3ubbu.streamlit.app/ ). All four prognostic models showed good discriminative ability and calibration. The RSF and GBS models exhibited the highest effectiveness among all models in predicting the long-term cancer-specific survival of patients with LPADC. This approach may facilitate the development of personalized treatment plans and prediction of prognosis for LPADC.

Project description:BackgroundAlternative splicing (AS) plays critical roles in generating protein diversity and complexity. Dysregulation of AS underlies the initiation and progression of tumors. Machine learning approaches have emerged as efficient tools to identify promising biomarkers. It is meaningful to explore pivotal AS events (ASEs) to deepen understanding and improve prognostic assessments of lung adenocarcinoma (LUAD) via machine learning algorithms.MethodRNA sequencing data and AS data were extracted from The Cancer Genome Atlas (TCGA) database and TCGA SpliceSeq database. Using several machine learning methods, we identified 24 pairs of LUAD-related ASEs implicated in splicing switches and a random forest-based classifiers for identifying lymph node metastasis (LNM) consisting of 12 ASEs. Furthermore, we identified key prognosis-related ASEs and established a 16-ASE-based prognostic model to predict overall survival for LUAD patients using Cox regression model, random survival forest analysis, and forward selection model. Bioinformatics analyses were also applied to identify underlying mechanisms and associated upstream splicing factors (SFs).ResultsEach pair of ASEs was spliced from the same parent gene, and exhibited perfect inverse intrapair correlation (correlation coefficient = - 1). The 12-ASE-based classifier showed robust ability to evaluate LNM status of LUAD patients with the area under the receiver operating characteristic (ROC) curve (AUC) more than 0.7 in fivefold cross-validation. The prognostic model performed well at 1, 3, 5, and 10 years in both the training cohort and internal test cohort. Univariate and multivariate Cox regression indicated the prognostic model could be used as an independent prognostic factor for patients with LUAD. Further analysis revealed correlations between the prognostic model and American Joint Committee on Cancer stage, T stage, N stage, and living status. The splicing network constructed of survival-related SFs and ASEs depicts regulatory relationships between them.ConclusionIn summary, our study provides insight into LUAD researches and managements based on these AS biomarkers.

Project description:Objectives: Lung adenocarcinoma (LUAD) accounts for a majority of cancer-related deaths worldwide annually. The identification of prognostic biomarkers and prediction of prognosis for LUAD patients is necessary. Materials and Methods: In this study, LUAD RNA-Seq data and clinical data from the Cancer Genome Atlas (TCGA) were divided into TCGA cohort I (n = 338) and II (n = 168). The cohort I was used for model construction, and the cohort II and data from Gene Expression Omnibus (GSE72094 cohort, n = 393; GSE11969 cohort, n = 149) were utilized for validation. First, the survival-related seed genes were selected from the cohort I using the machine learning model (random survival forest, RSF), and then in order to improve prediction accuracy, the forward selection model was utilized to identify the prognosis-related key genes among the seed genes using the clinically-integrated RNA-Seq data. Second, the survival risk score system was constructed by using these key genes in the cohort II, the GSE72094 cohort and the GSE11969 cohort, and the evaluation metrics such as HR, p value and C-index were calculated to validate the proposed method. Third, the developed approach was compared with the previous five prediction models. Finally, bioinformatics analyses (pathway, heatmap, protein-gene interaction network) have been applied to the identified seed genes and key genes. Results and Conclusion: Based on the RSF model and clinically-integrated RNA-Seq data, we identified sixteen key genes that formed the prognostic gene expression signature. These sixteen key genes could achieve a strong power for prognostic prediction of LUAD patients in cohort II (HR = 3.80, p = 1.63e-06, C-index = 0.656), and were further validated in the GSE72094 cohort (HR = 4.12, p = 1.34e-10, C-index = 0.672) and GSE11969 cohort (HR = 3.87, p = 6.81e-07, C-index = 0.670). The experimental results of three independent validation cohorts showed that compared with the traditional Cox model and the use of standalone RNA-Seq data, the machine-learning-based method effectively improved the prediction accuracy of LUAD prognosis, and the derived model was also superior to the other five existing prediction models. KEGG pathway analysis found eleven of the sixteen genes were associated with Nicotine addiction. Thirteen of the sixteen genes were reported for the first time as the LUAD prognosis-related key genes. In conclusion, we developed a sixteen-gene prognostic marker for LUAD, which may provide a powerful prognostic tool for precision oncology.

Project description:BackgroundDisulfidptosis is a newly identified variant of cell death characterized by disulfide accumulation, which is independent of ATP depletion. Accordingly, the latent influence of disulfidptosis on the prognosis of lung adenocarcinoma (LUAD) patients and the progression of tumors remains poorly understood.MethodsWe conducted a multifaceted analysis of the transcriptional and genetic modifications in disulfidptosis regulators (DRs) specific to LUAD, followed by an evaluation of their expression configurations to define DR clusters. Harnessing the differentially expressed genes (DEGs) identified from these clusters, we formulated an optimal predictive model by amalgamating 10 distinct machine learning algorithms across 101 unique combinations to compute the disulfidptosis score (DS). Patients were subsequently stratified into high and low DS cohorts based on median DS values. We then performed an exhaustive comparison between these cohorts, focusing on somatic mutations, clinical attributes, tumor microenvironment, and treatment responsiveness. Finally, we empirically validated the biological implications of a critical gene, KYNU, through assays in LUAD cell lines.ResultsWe identified two DR clusters and there were great differences in overall survival (OS) and tumor microenvironment. We selected the "Least Absolute Shrinkage and Selection Operator (LASSO) + Random Survival Forest (RFS)" algorithm to develop a DS based on the average C-index across different cohorts. Our model effectively stratified LUAD patients into high- and low-DS subgroups, with this latter demonstrating superior OS, a reduced mutational landscape, enhanced immune status, and increased sensitivity to immunotherapy. Notably, the predictive accuracy of DS outperformed the published LUAD signature and clinical features. Finally, we validated the DS expression using clinical samples and found that inhibiting KYNU suppressed LUAD cells proliferation, invasiveness, and migration in vitro.ConclusionsThe DR-based scoring system that we developed enabled accurate prognostic stratification of LUAD patients and provides important insights into the molecular mechanisms and treatment strategies for LUAD.

Project description:Lung cancer is the second most common cancer in the United States and the leading cause of mortality in cancer patients. Biomarkers predicting survival of patients with lung cancer have a profound effect on patient prognosis and treatment. However, predictive biomarkers for survival and their relevance for lung cancer are not been well known yet. The objective of this study was to perform machine learning with data from The Cancer Genome Atlas of patients with lung adenocarcinoma (LUAD) to find survival-specific gene mutations that could be used as survival-predicting biomarkers. To identify survival-specific mutations according to various clinical factors, four feature selection methods (information gain, chi-squared test, minimum redundancy maximum relevance, and correlation) were used. Extracted survival-specific mutations of LUAD were applied individually or as a group for Kaplan-Meier survival analysis. Mutations in MMRN2 and GMPPA were significantly associated with patient mortality while those in ZNF560 and SETX were associated with patient survival. Mutations in DNAJC2 and MMRN2 showed significant negative association with overall survival while mutations in ZNF560 showed significant positive association with overall survival. Mutations in MMRN2 showed significant negative association with disease-free survival while mutations in DRD3 and ZNF560 showed positive associated with disease-free survival. Mutations in DRD3, SETX, and ZNF560 showed significant positive association with survival in patients with LUAD while the opposite was true for mutations in DNAJC2, GMPPA, and MMRN2. These gene mutations were also found in other cohorts of LUAD, lung squamous cell carcinoma, and small cell lung cancer. In LUAD of Pan-Lung Cancer cohort, mutations in GMPPA, DNAJC2, and MMRN2 showed significant negative associations with survival of patients while mutations in DRD3 and SETX showed significant positive association with survival. In this study, machine learning was conducted to obtain information necessary to discover specific gene mutations associated with the survival of patients with LUAD. Mutations in the above six genes could predict survival rate and disease-free survival rate in patients with LUAD. Thus, they are important biomarker candidates for prognosis.

Project description:BackgroundThe aim of this retrospective research was to develop an immune-related genes significantly associated with m5C methylation methylation (m5C-IRGs)-related signature associated with lung adenocarainoma (LUAD).MethodsWe introduced transcriptome data to screen out m5C-IRGs in The Cancer Genome Atlas (TCGA)-LUAD dataset. Subsequently, the m5C-IRGs associated with survival were certificated by Kaplan Meier (K-M) analysis. The univariate Cox, least absolute shrinkage and selection operator (LASSO) regression, and xgboost.surv tool were adopted to build a LUAD prognostic signature. We further conducted gene functional enrichment, immune microenvironment and immunotherapy analysis between 2 risk subgroups. Finally, we verified m5C-IRGs-related prognostic gene expression in transcription level.ResultsA total of 76 m5C-IRGs were identified in TCGA-LUAD dataset. Furthermore, 27 m5C-IRGs associated with survival were retained. Then, a m5C-IRGs prognostic signature was build based on the 3 prognostic genes (HLA-DMB, PPIA, and GPI). Independent prognostic analysis suggested that stage and RiskScore could be used as independent prognostic factors. We found that 4104 differentially expressed genes (DEGs) between the 2 risk subgroups were mainly concerned in immune receptor pathways. We found certain distinction in LUAD immune microenvironment between the 2 risk subgroups. Then, immunotherapy analysis and chemotherapeutic drug sensitivity results indicated that the m5C-IRGs-related gene signature might be applied as a therapy predictor. Finally, we found significant higher expression of PPIA and GPI in LUAD group compared to the normal group.ConclusionsThe prognostic signature comprised of HLA-DMB, PPIA, and GPI based on m5C-IRGs was established, which might provide theoretical basis and reference value for the research of LUAD.Public datasets analyzed in the studyTCGA-LUAD dataset was collected from the TCGA (https://portal.gdc.cancer.gov/) database, GSE31210 (validation set) was retrieved from GEO (https://www.ncbi.nlm.nih.gov/geo/) database.