Dataset Information

Modeling and comparing data mining algorithms for prediction of recurrence of breast cancer.

ABSTRACT: Breast cancer is the most common invasive cancer and the second leading cause of cancer death in women. and regrettably, this rate is increasing every year. One of the aspects of all cancers, including breast cancer, is the recurrence of the disease, which causes painful consequences to the patients. Moreover, the practical application of data mining in the field of breast cancer can help to provide some necessary information and knowledge required by physicians for accurate prediction of breast cancer recurrence and better decision-making. The main objective of this study is to compare different data mining algorithms to select the most accurate model for predicting breast cancer recurrence. This study is cross-sectional and data gathering of this research performed from June 2018 to June 2019 from the official statistics of Ministry of Health and Medical Education and the Iran Cancer Research Center for patients with breast cancer who had been followed for a minimum of 5 years from February 2014 to April 2019, including 5471 independent records. After initial pre-processing in dataset and variables, seven new and conventional data mining algorithms have been applied that each one represents one kind of data mining approach. Results show that the C5.0 algorithm possibly could be a helpful tool for the prediction of breast cancer recurrence at the stage of distant recurrence and nonrecurrence, especially in the first to third years. also, LN involvement rate, Her2 value, Tumor size, free or closed tumor margin were found to be the most important features in our dataset to predict breast cancer recurrence.

SUBMITTER: Mosayebi A

PROVIDER: S-EPMC7561198 | biostudies-literature | 2020

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Modeling and comparing data mining algorithms for prediction of recurrence of breast cancer.

Mosayebi Alireza A Mojaradi Barat B Mojaradi Barat B Bonyadi Naeini Ali A Khodadad Hosseini Seyed Hamid SH

PloS one 20201015 10

Breast cancer is the most common invasive cancer and the second leading cause of cancer death in women. and regrettably, this rate is increasing every year. One of the aspects of all cancers, including breast cancer, is the recurrence of the disease, which causes painful consequences to the patients. Moreover, the practical application of data mining in the field of breast cancer can help to provide some necessary information and knowledge required by physicians for accurate prediction of breast ...[more]

PMID: 33057328

Similar Datasets

Project description:BACKGROUND:About 90% of patients who have diabetes suffer from Type 2 DM (T2DM). Many studies suggest using the significant role of lncRNAs to improve the diagnosis of T2DM. Machine learning and Data Mining techniques are tools that can improve the analysis and interpretation or extraction of knowledge from the data. These techniques may enhance the prognosis and diagnosis associated with reducing diseases such as T2DM. We applied four classification models, including K-nearest neighbor (KNN), support vector machine (SVM), logistic regression, and artificial neural networks (ANN) for diagnosing T2DM, and we compared the diagnostic power of these algorithms with each other. We performed the algorithms on six LncRNA variables (LINC00523, LINC00995, HCG27_201, TPT1-AS1, LY86-AS1, DKFZP) and demographic data. RESULTS:To select the best performance, we considered the AUC, sensitivity, specificity, plotted the ROC curve, and showed the average curve and range. The mean AUC for the KNN algorithm was 91% with 0.09 standard deviation (SD); the mean sensitivity and specificity were 96 and 85%, respectively. After applying the SVM algorithm, the mean AUC obtained 95% after stratified 10-fold cross-validation, and the SD obtained 0.05. The mean sensitivity and specificity were 95 and 86%, respectively. The mean AUC for ANN and the SD were 93% and 0.03, also the mean sensitivity and specificity were 78 and 85%. At last, for the logistic regression algorithm, our results showed 95% of mean AUC, and the SD of 0.05, the mean sensitivity and specificity were 92 and 85%, respectively. According to the ROCs, the Logistic Regression and SVM had a better area under the curve compared to the others. CONCLUSION:We aimed to find the best data mining approach for the prediction of T2DM using six lncRNA expression. According to the finding, the maximum AUC dedicated to SVM and logistic regression, among others, KNN and ANN also had the high mean AUC and small standard deviations of AUC scores among the approaches, KNN had the highest mean sensitivity and the highest specificity belonged to SVM. This study's result could improve our knowledge about the early detection and diagnosis of T2DM using the lncRNAs as biomarkers.

Project description:IntroductionMutations in BRCA1 and BRCA2 confer high risks of breast cancer and ovarian cancer. The risk prediction algorithm BOADICEA (Breast and Ovarian Analysis of Disease Incidence and Carrier Estimation Algorithm) may be used to compute the probabilities of carrying mutations in BRCA1 and BRCA2 and help to target mutation screening. Tumours from BRCA1 and BRCA2 mutation carriers display distinctive pathological features that could be used to better discriminate between BRCA1 mutation carriers, BRCA2 mutation carriers and noncarriers. In particular, oestrogen receptor (ER)-negative status, triple-negative (TN) status, and expression of basal markers are predictive of BRCA1 mutation carrier status.MethodsWe extended BOADICEA by treating breast cancer subtypes as distinct disease end points. Age-specific expression of phenotypic markers in a series of tumours from 182 BRCA1 mutation carriers, 62 BRCA2 mutation carriers and 109 controls from the Breast Cancer Linkage Consortium, and over 300,000 tumours from the general population obtained from the Surveillance Epidemiology, and End Results database, were used to calculate age-specific and genotype-specific incidences of each disease end point. The probability that an individual carries a BRCA1 or BRCA2 mutation given their family history and tumour marker status of family members was computed in sample pedigrees.ResultsThe cumulative risk of ER-negative breast cancer by age 70 for BRCA1 mutation carriers was estimated to be 55% and the risk of ER-positive disease was 18%. The corresponding risks for BRCA2 mutation carriers were 21% and 44% for ER-negative and ER-positive disease, respectively. The predicted BRCA1 carrier probabilities among ER-positive breast cancer cases were less than 1% at all ages. For women diagnosed with breast cancer below age 50 years, these probabilities rose to more than 5% in ER-negative breast cancer, 7% in TN disease and 24% in TN breast cancer expressing both CK5/6 and CK14 cytokeratins. Large differences in mutation probabilities were observed by combining ER status and other informative markers with family history.ConclusionsThis approach combines both full pedigree and tumour subtype data to predict BRCA1/2 carrier probabilities. Prediction of BRCA1/2 carrier status, and hence selection of women for mutation screening, may be substantially improved by combining tumour pathology with family history of cancer.

Project description:Many chemicals that disrupt endocrine function have been linked to a variety of adverse biological outcomes. However, screening for endocrine disruption using in vitro or in vivo approaches is costly and time-consuming. Computational methods, e.g., quantitative structure-activity relationship models, have become more reliable due to bigger training sets, increased computing power, and advanced machine learning algorithms, such as multilayered artificial neural networks. Machine learning models can be used to predict compounds for endocrine disrupting capabilities, such as binding to the estrogen receptor (ER), and allow for prioritization and further testing. In this work, an exhaustive comparison of multiple machine learning algorithms, chemical spaces, and evaluation metrics for ER binding was performed on public data sets curated using in-house cheminformatics software (Assay Central). Chemical features utilized in modeling consisted of binary fingerprints (ECFP6, FCFP6, ToxPrint, or MACCS keys) and continuous molecular descriptors from RDKit. Each feature set was subjected to classic machine learning algorithms (Bernoulli Naive Bayes, AdaBoost Decision Tree, Random Forest, Support Vector Machine) and Deep Neural Networks (DNN). Models were evaluated using a variety of metrics: recall, precision, F1-score, accuracy, area under the receiver operating characteristic curve, Cohen's Kappa, and Matthews correlation coefficient. For predicting compounds within the training set, DNN has an accuracy higher than that of other methods; however, in 5-fold cross validation and external test set predictions, DNN and most classic machine learning models perform similarly regardless of the data set or molecular descriptors used. We have also used the rank normalized scores as a performance-criteria for each machine learning method, and Random Forest performed best on the validation set when ranked by metric or by data sets. These results suggest classic machine learning algorithms may be sufficient to develop high quality predictive models of ER activity.

Dataset Information

Modeling and comparing data mining algorithms for prediction of recurrence of breast cancer.

Publications

Modeling and comparing data mining algorithms for prediction of recurrence of breast cancer.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets