Dataset Information

Selecting a single model or combining multiple models for microarray-based classifier development?--a comparative analysis based on large and diverse datasets generated from the MAQC-II project.

ABSTRACT:

Background

Genomic biomarkers play an increasing role in both preclinical and clinical application. Development of genomic biomarkers with microarrays is an area of intensive investigation. However, despite sustained and continuing effort, developing microarray-based predictive models (i.e., genomics biomarkers) capable of reliable prediction for an observed or measured outcome (i.e., endpoint) of unknown samples in preclinical and clinical practice remains a considerable challenge. No straightforward guidelines exist for selecting a single model that will perform best when presented with unknown samples. In the second phase of the MicroArray Quality Control (MAQC-II) project, 36 analysis teams produced a large number of models for 13 preclinical and clinical endpoints. Before external validation was performed, each team nominated one model per endpoint (referred to here as 'nominated models') from which MAQC-II experts selected 13 'candidate models' to represent the best model for each endpoint. Both the nominated and candidate models from MAQC-II provide benchmarks to assess other methodologies for developing microarray-based predictive models.

Methods

We developed a simple ensemble method by taking a number of the top performing models from cross-validation and developing an ensemble model for each of the MAQC-II endpoints. We compared the ensemble models with both nominated and candidate models from MAQC-II using blinded external validation.

Results

For 10 of the 13 MAQC-II endpoints originally analyzed by the MAQC-II data analysis team from the National Center for Toxicological Research (NCTR), the ensemble models achieved equal or better predictive performance than the NCTR nominated models. Additionally, the ensemble models had performance comparable to the MAQC-II candidate models. Most ensemble models also had better performance than the nominated models generated by five other MAQC-II data analysis teams that analyzed all 13 endpoints.

Conclusions

Our findings suggest that an ensemble method can often attain a higher average predictive performance in an external validation set than a corresponding "optimized" model method. Using an ensemble method to determine a final model is a potentially important supplement to the good modeling practices recommended by the MAQC-II project for developing microarray-based genomic biomarkers.

SUBMITTER: Chen M

PROVIDER: S-EPMC3236846 | biostudies-literature | 2011 Oct

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Selecting a single model or combining multiple models for microarray-based classifier development?--a comparative analysis based on large and diverse datasets generated from the MAQC-II project.

Chen Minjun M Shi Leming L Kelly Reagan R Perkins Roger R Fang Hong H Tong Weida W

BMC bioinformatics 20111018

<h4>Background</h4>Genomic biomarkers play an increasing role in both preclinical and clinical application. Development of genomic biomarkers with microarrays is an area of intensive investigation. However, despite sustained and continuing effort, developing microarray-based predictive models (i.e., genomics biomarkers) capable of reliable prediction for an observed or measured outcome (i.e., endpoint) of unknown samples in preclinical and clinical practice remains a considerable challenge. No s ...[more]

PMID: 22166133

Similar Datasets

Project description:The MAQC-II Project: A comprehensive study of common practices for the development and validation of microarray-based predictive models The second phase of the MicroArray Quality Control (MAQC-II) project evaluated common practices for developing and validating microarray-based models aimed at predicting toxicological and clinical endpoints. The purposes of the MAQC-II project were to survey approaches in genomic model development in an attempt to understand sources of variability in prediction performance, and to assess the influences of endpoint signal strength in data. Thirty-six teams developed classifiers for 13 diverse endpoints -- some easy, some difficult to predict, from six relatively large training data sets -- three preclinical (toxicogenomics) and three clinical. By providing the same data sets to many organizations for analysis, but not restricting their data analysis protocols (DAPs), the project made it possible to evaluate to what extent, if any, results depend on the team that performs the analysis. These analyses collectively produced >18,000 models that were challenged by independent and blinded validation sets generated for MAQC-II. The cross-validated performance estimates for models developed under good practices are predictive of the blinded validation performance. The achievable prediction performance is largely determined by the intrinsic predictability of the endpoint, and simple data analysis methods often perform as well as more complicated approaches. Multiple models of comparable performance can be developed for a given endpoint and the stability of gene lists correlates with endpoint predictability. Importantly, similar conclusions were reached when >12,000 new models were generated by swapping the original training and validation sets. Description of six data sets including 13 prediction endpoints: (Summarized in GSE16716_MAQC-II_Datasets_Overview.pdf attached as supplementary file. For more details, see the MAQC-II main paper and its references for individual dataset.) The MAQC-II predictive modeling was limited to binary classification problems; therefore, continuous endpoint values such as overall survival (OS) and event-free survival (EFS) times were dichotomized using a "milestone" cutoff of censor data. Prediction endpoints were chosen to span a wide range of prediction difficulty. Two endpoints, H (CPS1) and L (NEP_S), representing the gender of the patients, were used as positive control endpoints, since they are easily predictable by microarrays. Two other endpoints, I (CPS1) and M (NEP_R), representing randomly assigned class labels, were designed to serve as negative control endpoints, since they are not supposed to be predictable. Data analysis teams were not aware of the characteristics of endpoints H, I, L, and M until their swap prediction results had been submitted. If a data analysis protocol did not yield models to accurately predict endpoints H and L, or if a data analysis protocol claims to be able to yield models to accurately predict endpoints I and M, something must have gone wrong. The Hamner data set (endpoint A) was provided by The Hamner Institutes for Health Sciences (Research Triangle Park, NC, USA). The study objective was to apply microarray gene expression data from the lung of female B6C3F1 mice exposed to a 13-week treatment of chemicals to predict increased lung tumor incidence in the 2-year rodent cancer bioassays of the National Toxicology Program. If successful, the results may form the basis of a more efficient and economical approach for evaluating the carcinogenic activity of chemicals. Microarray analysis was performed using Affymetrix Mouse Genome 430 2.0 arrays on three to four mice per treatment group, and a total of 70 mice were analyzed and used as the MAQC-II's training set. Additional data from another set of 88 mice were collected later and provided as the MAQC-II's external validation set. The Iconix data set (endpoint B) was provided by Iconix Biosciences, Inc. (Mountain View, CA, USA). The study objective was to assess, upon short term exposure, hepatic tumor induction by non-genotoxic chemicals, since there are currently no accurate and well-validated short-term tests to identify non-genotoxic hepatic tumorigens, thus necessitating an expensive 2-year rodent bioassay before a risk assessment can begin. The training set consists of hepatic gene expression data from 216 male Sprague-Dawley rats treated for 5 days with one of 76 structurally and mechanistically diverse nongenotoxic hepatocarcinogens and non-hepatocarcinogens. The validation set consists of 201 male Sprague-Dawley rats treated for 5 days with one of 68 structurally and mechanistically diverse non-genotoxic hepatocarcinogens and non- hepatocarcinogens. Gene expression data were generated using the Amersham Codelink Uniset Rat 1 Bioarray (GE HealthCare, Piscataway, NJ). The separation of the training set and validation set was based on the time when the microarray data were collected; i.e., microarrays processed earlier in the study were used as training and those processed later were used as validation. The NIEHS data set (endpoint C) was provided by the National Institute of Environmental Health Sciences (NIEHS) of the National Institutes of Health (Research Triangle Park, NC, USA). The study objective was to use microarray gene expression data acquired from the liver of rats exposed to hepatotoxicants to build classifiers for prediction of liver necrosis. The gene expression "compendium" data set was collected from 418 rats exposed to one of eight compounds (1,2- dichlorobenzene, 1,4-dichlorobenzene, bromobenzene, monocrotaline, N-nitrosomorpholine, thioacetamide, galactosamine, and diquat dibromide). All eight compounds were studied using standardized procedures, i.e. a common array platform (Affymetrix Rat 230 2.0 microarray), experimental procedures and data retrieving and analysis processes. Briefly, for each compound, four to six male, 12 week old F344 rats were exposed to a low dose, mid dose(s) and a high dose of the toxicant and sacrificed at 6, 24 and 48 hrs later. At necropsy, liver was harvested for RNA extraction, histopathology, and clinical chemistry assessments. The human breast cancer (BR) data set (endpoints D and E) was contributed by the University of Texas M. D. Anderson Cancer Center (MDACC, Houston, TX, USA). Gene expression data from 230 stage I-III breast cancers were generated from fine needle aspiration specimens of newly diagnosed breast cancers before any therapy. The biopsy specimens were collected sequentially during a prospective pharmacogenomic marker discovery study between 2000 and 2008. These specimens represent 70-90% pure neoplastic cells with minimal stromal contamination. Patients received 6 months of preoperative (neoadjuvant) chemotherapy including paclitaxel, 5-fluorouracil, cyclophosphamide and doxorubicin followed by surgical resection of the cancer. Response to preoperative chemotherapy was categorized as a pathological complete response (pCR = no residual invasive cancer in the breast or lymph nodes) or residual invasive cancer (RD), and used as endpoint D for prediction. Endpoint E is the clinical estrogen-receptor status as established by immunohistochemistry. RNA extraction and gene expression profiling were performed in multiple batches over time using Affymetrix U133A microarrays. Genomic analysis of a subset of this sequentially accrued patient population were reported previously. For each endpoint, the first 130 cases were used as a training set and the next 100 cases were used as an independent validation set. The multiple myeloma (MM) data set (endpoints F, G, H, and I) was contributed by the Myeloma Institute for Research and Therapy at the University of Arkansas for Medical Sciences (UAMS, Little Rock, AR, USA). Gene expression profiling of highly purified bone marrow plasma cells was performed in newly diagnosed patients with MM. The training set consisted of 340 cases enrolled on total therapy 2 (TT2) and the validation set comprised 214 patients enrolled in total therapy 3 (TT3). Plasma cells were enriched by anti-CD138 immunomagnetic bead selection of mononuclear cell fractions of bone marrow aspirates in a central laboratory. All samples applied to the microarray contained more than 85% plasma cells as determined by 2-color flow cytometry (CD38+ and CD45-/dim) performed after selection. Dichotomized overall survival (OS) and eventfree survival (EFS) were determined based on a two-year milestone cutoff. A gene expression model of high-risk multiple myeloma was developed and validated by the data provider and later on validated in three additional independent data sets. The neuroblastoma (NB) data set (endpoints J, K, L, and M) was contributed by the Children's Hospital of the University of Cologne, Germany. Tumor samples were checked by a pathologist prior to RNA isolation; only samples with =60% tumor content were utilized and total RNA was isolated from ~50mg of snap-frozen neuroblastoma tissue obtained before chemotherapeutic treatment. First, 502 pre-existing 11K Agilent dye-flipped, dual-color replicate profiles for 251 patients were provided. Of these, profiles of 246 neuroblastoma samples passed an independent MAQC-II quality assessment by majority decision and formed the MAQC-II training data set. Subsequently, 514 dyeflipped dual-color 11K replicate profiles for 256 independent neuroblastoma tumor samples were generated and profiles for 253 samples were selected to form the MAQC-II validation set. Of note, for one patient of the validation set, two different tumor samples were analyzed utilizing both versions of the 2x11K microarray (see below). All dual-color gene-expression of the MAQC-II training set were generated using a customized 2x11K neuroblastoma-related microarray. Furthermore, 20 patients of the MAQC-II validation set were also profiled utilizing this microarray. Dual-color profiles of the remaining patients of the MAQC-II validation set were performed using a slightly revised version of the 2x11K microarray. This version V2.0 of the array comprised 200 novel oligonucleotide probes whereas 100 oligonucleotide probes of the original design were removed due to consistent low expression values (near background) observed in the training set profiles. These minor modifications of the microarray design resulted in a total of 9,986 probes present on both versions of the 2x11K microarray. The experimental protocol did not differ between both sets and gene-expression profiles were performed as described. Furthermore, single-color geneexpression profiles were generated for 478/499 neuroblastoma samples of the MAQC-II dual-color training and validation sets (training set 244/246; validation set 234/253). For the remaining 21 samples no single-color data were available, due to either shortage of tumor material of these patients (n=15), poor experimental quality of the generated single-color profiles (n=5), or correlation of one single-color profile to two different dual-color profiles for the one patient profiled with both versions of the 2x11K microarrays (n=1). Single-color gene-expression profiles were generated using customized 4x44K oligonucleotide microarrays produced by Agilent Technologies (Palo Alto, CA, USA). These 4x44K microarrays included all probes represented by Agilent's Whole Human Genome Oligo Microarray and all probes of the version V2.0 of the 2x11K customized microarray that were not present in the former probe set. Labeling and hybridization was performed following the manufacturer's protocol as described. This SuperSeries is composed of the SubSeries listed below.

Project description:BackgroundWith the popularity of DNA microarray technology, multiple groups of researchers have studied the gene expression of similar biological conditions. Different methods have been developed to integrate the results from various microarray studies, though most of them rely on distributional assumptions, such as the t-statistic based, mixed-effects model, or Bayesian model methods. However, often the sample size for each individual microarray experiment is small. Therefore, in this paper we present a non-parametric meta-analysis approach for combining data from independent microarray studies, and illustrate its application on two independent Affymetrix GeneChip studies that compared the gene expression of biopsies from kidney transplant recipients with chronic allograft nephropathy (CAN) to those with normal functioning allograft.ResultsThe simulation study comparing the non-parametric meta-analysis approach to a commonly used t-statistic based approach shows that the non-parametric approach has better sensitivity and specificity. For the application on the two CAN studies, we identified 309 distinct genes that expressed differently in CAN. By applying Fisher's exact test to identify enriched KEGG pathways among those genes called differentially expressed, we found 6 KEGG pathways to be over-represented among the identified genes. We used the expression measurements of the identified genes as predictors to predict the class labels for 6 additional biopsy samples, and the predicted results all conformed to their pathologist diagnosed class labels.ConclusionWe present a new approach for combining data from multiple independent microarray studies. This approach is non-parametric and does not rely on any distributional assumptions. The rationale behind the approach is logically intuitive and can be easily understood by researchers not having advanced training in statistics. Some of the identified genes and pathways have been reported to be relevant to renal diseases. Further study on the identified genes and pathways may lead to better understanding of CAN at the molecular level.

Dataset Information

Selecting a single model or combining multiple models for microarray-based classifier development?--a comparative analysis based on large and diverse datasets generated from the MAQC-II project.

Background

Methods

Results

Conclusions

Publications

Selecting a single model or combining multiple models for microarray-based classifier development?--a comparative analysis based on large and diverse datasets generated from the MAQC-II project.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets