Dataset Information

Factors affecting the accuracy of a class prediction model in gene expression data.

ABSTRACT: BACKGROUND:Class prediction models have been shown to have varying performances in clinical gene expression datasets. Previous evaluation studies, mostly done in the field of cancer, showed that the accuracy of class prediction models differs from dataset to dataset and depends on the type of classification function. While a substantial amount of information is known about the characteristics of classification functions, little has been done to determine which characteristics of gene expression data have impact on the performance of a classifier. This study aims to empirically identify data characteristics that affect the predictive accuracy of classification models, outside of the field of cancer. RESULTS:Datasets from twenty five studies meeting predefined inclusion and exclusion criteria were downloaded. Nine classification functions were chosen, falling within the categories: discriminant analyses or Bayes classifiers, tree based, regularization and shrinkage and nearest neighbors methods. Consequently, nine class prediction models were built for each dataset using the same procedure and their performances were evaluated by calculating their accuracies. The characteristics of each experiment were recorded, (i.e., observed disease, medical question, tissue/cell types and sample size) together with characteristics of the gene expression data, namely the number of differentially expressed genes, the fold changes and the within-class correlations. Their effects on the accuracy of a class prediction model were statistically assessed by random effects logistic regression. The number of differentially expressed genes and the average fold change had significant impact on the accuracy of a classification model and gave individual explained-variation in prediction accuracy of up to 72% and 57%, respectively. Multivariable random effects logistic regression with forward selection yielded the two aforementioned study factors and the within class correlation as factors affecting the accuracy of classification functions, explaining 91.5% of the between study variation. CONCLUSIONS:We evaluated study- and data-related factors that might explain the varying performances of classification functions in non-cancerous datasets. Our results showed that the number of differentially expressed genes, the fold change, and the correlation in gene expression data significantly affect the accuracy of class prediction models.

SUBMITTER: Novianti PW

PROVIDER: S-EPMC4475623 | biostudies-literature | 2015 Jun

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Factors affecting the accuracy of a class prediction model in gene expression data.

Novianti Putri W PW Jong Victor L VL Roes Kit C B KC Eijkemans Marinus J C MJ

BMC bioinformatics 20150621

<h4>Background</h4>Class prediction models have been shown to have varying performances in clinical gene expression datasets. Previous evaluation studies, mostly done in the field of cancer, showed that the accuracy of class prediction models differs from dataset to dataset and depends on the type of classification function. While a substantial amount of information is known about the characteristics of classification functions, little has been done to determine which characteristics of gene exp ...[more]

PMID: 26093633

Similar Datasets

Project description:Medical record abstraction (MRA) is often cited as a significant source of error in research data, yet MRA methodology has rarely been the subject of investigation. Lack of a common framework has hindered application of the extant literature in practice, and, until now, there were no evidence-based guidelines for ensuring data quality in MRA. We aimed to identify the factors affecting the accuracy of data abstracted from medical records and to generate a framework for data quality assurance and control in MRA.Candidate factors were identified from published reports of MRA. Content validity of the top candidate factors was assessed via a four-round two-group Delphi process with expert abstractors with experience in clinical research, registries, and quality improvement. The resulting coded factors were categorized into a control theory-based framework of MRA. Coverage of the framework was evaluated using the recent published literature.Analysis of the identified articles yielded 292 unique factors that affect the accuracy of abstracted data. Delphi processes overall refuted three of the top factors identified from the literature based on importance and five based on reliability (six total factors refuted). Four new factors were identified by the Delphi. The generated framework demonstrated comprehensive coverage. Significant underreporting of MRA methodology in recent studies was discovered.The framework generated from this research provides a guide for planning data quality assurance and control for studies using MRA. The large number and variability of factors indicate that while prospective quality assurance likely increases the accuracy of abstracted data, monitoring the accuracy during the abstraction process is also required. Recent studies reporting research results based on MRA rarely reported data quality assurance or control measures, and even less frequently reported data quality metrics with research results. Given the demonstrated variability, these methods and measures should be reported with research results.

Project description:Using genetic data to predict gene expression has garnered significant attention in recent years. PrediXcan has become one of the most widely used gene-based methods for testing associations between predicted gene expression values and a phenotype, which has facilitated novel insights into the relationship between complex traits and the component of gene expression that can be attributed to genetic variation. The gene expression prediction models for PrediXcan were developed using supervised machine learning methods and training data from the Depression Genes and Networks (DGN) study and the Genotype-Tissue Expression (GTEx) project, where the majority of subjects are of European descent. Many genetic studies, however, include samples from multi-ethnic populations, and in this paper we evaluate the accuracy of PrediXcan for predicting gene expression in diverse populations. Using transcriptomic data from the GEUVADIS (Genetic European Variation in Disease) RNA sequencing project and whole genome sequencing data from the 1000 Genomes project, we evaluate and compare the predictive performance of PrediXcan in an African population (Yoruban) and four European ancestry populations for thousands of genes. We evaluate a range of models from the PrediXcan weight databases and use Pearson's correlation coefficient to assess gene expression prediction accuracy with PrediXcan. From our evaluation, we find that the predictive performance of PrediXcan varies substantially among populations from different continents (F-test p-value < 2.2 × 10-16), where prediction accuracy is lower in the Yoruban population from West Africa compared to the European-ancestry populations. Moreover, not only do we find differences in predictive performance between populations from different continents, we also find highly significant differences in prediction accuracy among the four European ancestry populations considered (F-test p-value < 2.2 × 10-16). Finally, while there is variability in prediction accuracy across different PrediXcan weight databases, we also find consistency in the qualitative performance of PrediXcan for the five populations considered, with the African ancestry population having the lowest accuracy across databases.

Project description:BackgroundThe goal of class prediction studies is to develop rules to accurately predict the class membership of new samples. The rules are derived using the values of the variables available for each subject: the main characteristic of high-dimensional data is that the number of variables greatly exceeds the number of samples. Frequently the classifiers are developed using class-imbalanced data, i.e., data sets where the number of samples in each class is not equal. Standard classification methods used on class-imbalanced data often produce classifiers that do not accurately predict the minority class; the prediction is biased towards the majority class. In this paper we investigate if the high-dimensionality poses additional challenges when dealing with class-imbalanced prediction. We evaluate the performance of six types of classifiers on class-imbalanced data, using simulated data and a publicly available data set from a breast cancer gene-expression microarray study. We also investigate the effectiveness of some strategies that are available to overcome the effect of class imbalance.ResultsOur results show that the evaluated classifiers are highly sensitive to class imbalance and that variable selection introduces an additional bias towards classification into the majority class. Most new samples are assigned to the majority class from the training set, unless the difference between the classes is very large. As a consequence, the class-specific predictive accuracies differ considerably. When the class imbalance is not too severe, down-sizing and asymmetric bagging embedding variable selection work well, while over-sampling does not. Variable normalization can further worsen the performance of the classifiers.ConclusionsOur results show that matching the prevalence of the classes in training and test set does not guarantee good performance of classifiers and that the problems related to classification with class-imbalanced data are exacerbated when dealing with high-dimensional data. Researchers using class-imbalanced data should be careful in assessing the predictive accuracy of the classifiers and, unless the class imbalance is mild, they should always use an appropriate method for dealing with the class imbalance problem.

Project description:BackgroundIn practice, some drugs produce a number of negative biological effects that can mitigate their effectiveness as a remedy. To address this issue, several studies have been performed for the prediction of drug-induced toxicity from gene-expression data, and a significant amount of work has been done on predicting limited drug-induced symptoms or single-organ toxicity. Since drugs often lead to some injuries in several organs like liver or kidney, however, it would be very useful to forecast the drug-induced injuries for multiple organs. Therefore, in this work, our aim was to develop a multi-organ toxicity prediction model using an integrative model of gene-expression data.ResultsTo train our integrative model, we used 3708 in-vivo samples of gene-expression profiles exposed to one of 41 drugs related to 21 distinct physiological changes divided between liver and kidney (liver 11, kidney 10). Specifically, we used the gene-expression profiles to learn an ensemble classifier for each of 21 pathology prediction models. Subsequently, these classifiers were combined with weights to generate an integrative model for each pathological finding. The integrative model outputs the likeliness of presenting the trained pathology in a given test sample of gene-expression profile, called an integrative prediction score (IPS). For the evaluation of an integrative model, we estimated the prediction performance with the k-fold cross-validation. Our results demonstrate that the proposed integrative model is superior to individual pathology prediction models in predicting multi-organ drug-induced toxicities over all the targeted pathological findings. On average, the AUC of the integrative models was 88% while the AUC of individual pathology prediction models was 68%.ConclusionsNot only does this integrative model produce comparable prediction performance to existing approaches, but also it produces very stable performance overall. In addition, our approach is easily expandable to a variety of other multi-organ toxicology applications.

Project description:BackgroundTransmissible spongiform encephalopathy diseases are untreatable, uniformly fatal degenerative syndromes of the central nervous system that can be transmitted both within as well as between species. The bovine spongiform encephalopathy (BSE) epidemic and the emergence of a new human variant of Creutzfeldt-Jakob disease (vCJD), have profoundly influenced beef production processes as well as blood donation and surgical procedures. Simple, robust and cost effective diagnostic screening and surveillance tools are needed for both the preclinical and clinical stages of TSE disease in order to minimize both the economic costs and zoonotic risk of BSE and to further reduce the risk of secondary vCJD.ObjectiveUrine is well suited as the matrix for an ante-mortem test for TSE diseases because it would permit non-invasive and repeated sampling. In this study urine samples collected from BSE infected and age matched control cattle were screened for the presence of individual proteins that exhibited disease specific changes in abundance in response to BSE infection that might form the basis of such an ante-mortem test.ResultsTwo-dimensional differential gel electrophoresis (2D-DIGE) was used to identify proteins exhibiting differential abundance in two sets of cattle. The known set consisted of BSE infected steers and age matched controls throughout the course of the disease. The blinded unknown set was composed of BSE infected and control samples of both genders, a wide range of ages and two different breeds. Multivariate analyses of individual protein abundance data generated classifiers comprised of the proteins best able to discriminate between the samples based on disease state, breed, age and gender.ConclusionDespite the presence of confounding factors, the disease specific changes in abundance exhibited by a panel of urine proteins permitted the creation of classifiers able to discriminate between control and infected cattle with a high degree of accuracy.

Dataset Information

Factors affecting the accuracy of a class prediction model in gene expression data.

Publications

Factors affecting the accuracy of a class prediction model in gene expression data.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets