Dataset Information

Breast cancer prediction using genome wide single nucleotide polymorphism data.

ABSTRACT: BACKGROUND: This paper introduces and applies a genome wide predictive study to learn a model that predicts whether a new subject will develop breast cancer or not, based on her SNP profile. RESULTS: We first genotyped 696 female subjects (348 breast cancer cases and 348 apparently healthy controls), predominantly of Caucasian origin from Alberta, Canada using Affymetrix Human SNP 6.0 arrays. Then, we applied EIGENSTRAT population stratification correction method to remove 73 subjects not belonging to the Caucasian population. Then, we filtered any SNP that had any missing calls, whose genotype frequency was deviated from Hardy-Weinberg equilibrium, or whose minor allele frequency was less than 5%. Finally, we applied a combination of MeanDiff feature selection method and KNN learning method to this filtered dataset to produce a breast cancer prediction model. LOOCV accuracy of this classifier is 59.55%. Random permutation tests show that this result is significantly better than the baseline accuracy of 51.52%. Sensitivity analysis shows that the classifier is fairly robust to the number of MeanDiff-selected SNPs. External validation on the CGEMS breast cancer dataset, the only other publicly available breast cancer dataset, shows that this combination of MeanDiff and KNN leads to a LOOCV accuracy of 60.25%, which is significantly better than its baseline of 50.06%. We then considered a dozen different combinations of feature selection and learning method, but found that none of these combinations produces a better predictive model than our model. We also considered various biological feature selection methods like selecting SNPs reported in recent genome wide association studies to be associated with breast cancer, selecting SNPs in genes associated with KEGG cancer pathways, or selecting SNPs associated with breast cancer in the F-SNP database to produce predictive models, but again found that none of these models achieved accuracy better than baseline. CONCLUSIONS: We anticipate producing more accurate breast cancer prediction models by recruiting more study subjects, providing more accurate labelling of phenotypes (to accommodate the heterogeneity of breast cancer), measuring other genomic alterations such as point mutations and copy number variations, and incorporating non-genetic information about subjects such as environmental and lifestyle factors.

SUBMITTER: Hajiloo M

PROVIDER: S-EPMC3891310 | biostudies-literature | 2013

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Breast cancer prediction using genome wide single nucleotide polymorphism data.

Hajiloo Mohsen M Damavandi Babak B Hooshsadat Metanat M Sangi Farzad F Mackey John R JR Cass Carol E CE Greiner Russell R Damaraju Sambasivarao S

BMC bioinformatics 20131001

<h4>Background</h4>This paper introduces and applies a genome wide predictive study to learn a model that predicts whether a new subject will develop breast cancer or not, based on her SNP profile.<h4>Results</h4>We first genotyped 696 female subjects (348 breast cancer cases and 348 apparently healthy controls), predominantly of Caucasian origin from Alberta, Canada using Affymetrix Human SNP 6.0 arrays. Then, we applied EIGENSTRAT population stratification correction method to remove 73 subjec ...[more]

PMID: 24266904

Similar Datasets

Project description:BackgroundWe aimed at predicting fatigue after breast cancer treatment using machine learning on clinical covariates and germline genome-wide data.MethodsWe accessed germline genome-wide data of 2799 early-stage breast cancer patients from the Cancer Toxicity study (NCT01993498). The primary endpoint was defined as scoring zero at diagnosis and higher than quartile 3 at 1 year after primary treatment completion on European Organization for Research and Treatment of Cancer quality-of-life questionnaires for Overall Fatigue and on the multidimensional questionnaire for Physical, Emotional, and Cognitive fatigue. First, we tested univariate associations of each endpoint with clinical variables and genome-wide variants. Then, using preselected clinical (false discovery rate < 0.05) and genomic (P < .001) variables, a multivariable preconditioned random-forest regression model was built and validated on a hold-out subset to predict fatigue. Gene set enrichment analysis identified key biological correlates (MetaCore). All statistical tests were 2-sided.ResultsStatistically significant clinical associations were found only with Emotional and Cognitive Fatigue, including receipt of chemotherapy, anxiety, and pain. Some single nucleotide polymorphisms had some degree of association (P < .001) with the different fatigue endpoints, although there were no genome-wide statistically significant (P < 5.00 × 10-8) associations. Only for Cognitive Fatigue, the predictive ability of the genomic multivariable model was statistically significantly better than random (area under the curve = 0.59, P = .01) and marginally improved with clinical variables (area under the curve = 0.60, P = .005). Single nucleotide polymorphisms found to be associated (P < .001) with Cognitive Fatigue belonged to genes linked to inflammation (false discovery rate adjusted P = .03), cognitive disorders (P = 1.51 × 10-12), and synaptic transmission (P = 6.28 × 10-8).ConclusionsGenomic analyses in this large cohort of breast cancer survivors suggest a possible genetic role for severe Cognitive Fatigue that warrants further exploration.

Dataset Information

Breast cancer prediction using genome wide single nucleotide polymorphism data.

Publications

Breast cancer prediction using genome wide single nucleotide polymorphism data.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets