ABSTRACT: "The MAQC-II Project: A comprehensive study of common practices for the development and validation of microarray-based predictive models The second phase of the MicroArray Quality Control (MAQC-II) project evaluated common practices for developing and validating microarray-based models aimed at predicting toxicological and clinical endpoints. The purposes of the MAQC-II project were to survey approaches in genomic model development in an attempt to understand sources of variability in prediction performance, and to assess the influences of endpoint signal strength in data. Thirty-six teams developed classifiers for 13 diverse endpoints -- some easy, some difficult to predict, from six relatively large training data sets -- three preclinical (toxicogenomics) and three clinical. By providing the same data sets to many organizations for analysis, but not restricting their data analysis protocols (DAPs), the project made it possible to evaluate to what extent, if any, results depend on the team that performs the analysis. These analyses collectively produced >18,000 models that were challenged by independent and blinded validation sets generated for MAQC-II. The cross-validated performance estimates for models developed under good practices are predictive of the blinded validation performance. The achievable prediction performance is largely determined by the intrinsic predictability of the endpoint, and simple data analysis methods often perform as well as more complicated approaches. Multiple models of comparable performance can be developed for a given endpoint and the stability of gene lists correlates with endpoint predictability. Importantly, similar conclusions were reached when >12,000 new models were generated by swapping the original training and validation sets. Description of six data sets including 13 prediction endpoints: (Summarized in GSE16716_MAQC-II_Datasets_Overview.pdf attached as supplementary file. For more details, see the MAQC-II main paper and its references for individual dataset.) The MAQC-II predictive modeling was limited to binary classification problems; therefore, continuous endpoint values such as overall survival (OS) and event-free survival (EFS) times were dichotomized using a "milestone" cutoff of censor data. Prediction endpoints were chosen to span a wide range of prediction difficulty. Two endpoints, H (CPS1) and L (NEP_S), representing the gender of the patients, were used as positive control endpoints, since they are easily predictable by microarrays. Two other endpoints, I (CPS1) and M (NEP_R), representing randomly assigned class labels, were designed to serve as negative control endpoints, since they are not supposed to be predictable. Data analysis teams were not aware of the characteristics of endpoints H, I, L, and M until their swap prediction results had been submitted. If a data analysis protocol did not yield models to accurately predict endpoints H and L, or if a data analysis protocol claims to be able to yield models to accurately predict endpoints I and M, something must have gone wrong. The Hamner data set (endpoint A) was provided by The Hamner Institutes for Health Sciences (Research Triangle Park, NC, USA). The study objective was to apply microarray gene expression data from the lung of female B6C3F1 mice exposed to a 13-week treatment of chemicals to predict increased lung tumor incidence in the 2-year rodent cancer bioassays of the National Toxicology Program. If successful, the results may form the basis of a more efficient and economical approach for evaluating the carcinogenic activity of chemicals. Microarray analysis was performed using Affymetrix Mouse Genome 430 2.0 arrays on three to four mice per treatment group, and a total of 70 mice were analyzed and used as the MAQC-II's training set. Additional data from another set of 88 mice were collected later and provided as the MAQC-II's external validation set. The Iconix data set (endpoint B) was provided by Iconix Biosciences, Inc. (Mountain View, CA, USA). The study objective was to assess, upon short term exposure, hepatic tumor induction by non-genotoxic chemicals, since there are currently no accurate and well-validated short-term tests to identify non-genotoxic hepatic tumorigens, thus necessitating an expensive 2-year rodent bioassay before a risk assessment can begin. The training set consists of hepatic gene expression data from 216 male Sprague-Dawley rats treated for 5 days with one of 76 structurally and mechanistically diverse nongenotoxic hepatocarcinogens and non-hepatocarcinogens. The validation set consists of 201 male Sprague-Dawley rats treated for 5 days with one of 68 structurally and mechanistically diverse non-genotoxic hepatocarcinogens and non- hepatocarcinogens. Gene expression data were generated using the Amersham Codelink Uniset Rat 1 Bioarray (GE HealthCare, Piscataway, NJ). The separation of the training set and validation set was based on the time when the microarray data were collected; i.e., microarrays processed earlier in the study were used as training and those processed later were used as validation. The NIEHS data set (endpoint C) was provided by the National Institute of Environmental Health Sciences (NIEHS) of the National Institutes of Health (Research Triangle Park, NC, USA). The study objective was to use microarray gene expression data acquired from the liver of rats exposed to hepatotoxicants to build classifiers for prediction of liver necrosis. The gene expression "compendium" data set was collected from 418 rats exposed to one of eight compounds (1,2- dichlorobenzene, 1,4-dichlorobenzene, bromobenzene, monocrotaline, N-nitrosomorpholine, thioacetamide, galactosamine, and diquat dibromide). All eight compounds were studied using standardized procedures, i.e. a common array platform (Affymetrix Rat 230 2.0 microarray), experimental procedures and data retrieving and analysis processes. Briefly, for each compound, four to six male, 12 week old F344 rats were exposed to a low dose, mid dose(s) and a high dose of the toxicant and sacrificed at 6, 24 and 48 hrs later. At necropsy, liver was harvested for RNA extraction, histopathology, and clinical chemistry assessments. The human breast cancer (BR) data set (endpoints D and E) was contributed by the University of Texas M. D. Anderson Cancer Center (MDACC, Houston, TX, USA). Gene expression data from 230 stage I-III breast cancers were generated from fine needle aspiration specimens of newly diagnosed breast cancers before any therapy. The biopsy specimens were collected sequentially during a prospective pharmacogenomic marker discovery study between 2000 and 2008. These specimens represent 70-90% pure neoplastic cells with minimal stromal contamination. Patients received 6 months of preoperative (neoadjuvant) chemotherapy including paclitaxel, 5-fluorouracil, cyclophosphamide and doxorubicin followed by surgical resection of the cancer. Response to preoperative chemotherapy was categorized as a pathological complete response (pCR = no residual invasive cancer in the breast or lymph nodes) or residual invasive cancer (RD), and used as endpoint D for prediction. Endpoint E is the clinical estrogen-receptor status as established by immunohistochemistry. RNA extraction and gene expression profiling were performed in multiple batches over time using Affymetrix U133A microarrays. Genomic analysis of a subset of this sequentially accrued patient population were reported previously. For each endpoint, the first 130 cases were used as a training set and the next 100 cases were used as an independent validation set. The multiple myeloma (MM) data set (endpoints F, G, H, and I) was contributed by the Myeloma Institute for Research and Therapy at the University of Arkansas for Medical Sciences (UAMS, Little Rock, AR, USA). Gene expression profiling of highly purified bone marrow plasma cells was performed in newly diagnosed patients with MM. The training set consisted of 340 cases enrolled on total therapy 2 (TT2) and the validation set comprised 214 patients enrolled in total therapy 3 (TT3). Plasma cells were enriched by anti-CD138 immunomagnetic bead selection of mononuclear cell fractions of bone marrow aspirates in a central laboratory. All samples applied to the microarray contained more than 85% plasma cells as determined by 2-color flow cytometry (CD38+ and CD45-/dim) performed after selection. Dichotomized overall survival (OS) and eventfree survival (EFS) were determined based on a two-year milestone cutoff. A gene expression model of high-risk multiple myeloma was developed and validated by the data provider and later on validated in three additional independent data sets. The neuroblastoma (NB) data set (endpoints J, K, L, and M) was contributed by the Children's Hospital of the University of Cologne, Germany. Tumor samples were checked by a pathologist prior to RNA isolation; only samples with =60% tumor content were utilized and total RNA was isolated from ~50mg of snap-frozen neuroblastoma tissue obtained before chemotherapeutic treatment. First, 502 pre-existing 11K Agilent dye-flipped, dual-color replicate profiles for 251 patients were provided. Of these, profiles of 246 neuroblastoma samples passed an independent MAQC-II quality assessment by majority decision and formed the MAQC-II training data set. Subsequently, 514 dyeflipped dual-color 11K replicate profiles for 256 independent neuroblastoma tumor samples were generated and profiles for 253 samples were selected to form the MAQC-II validation set. Of note, for one patient of the validation set, two different tumor samples were analyzed utilizing both versions of the 2x11K microarray (see below). All dual-color gene-expression of the MAQC-II training set were generated using a customized 2x11K neuroblastoma-related microarray. Furthermore, 20 patients of the MAQC-II validation set were also profiled utilizing this microarray. Dual-color profiles of the remaining patients of the MAQC-II validation set were performed using a slightly revised version of the 2x11K microarray. This version V2.0 of the array comprised 200 novel oligonucleotide probes whereas 100 oligonucleotide probes of the original design were removed due to consistent low expression values (near background) observed in the training set profiles. These minor modifications of the microarray design resulted in a total of 9,986 probes present on both versions of the 2x11K microarray. The experimental protocol did not differ between both sets and gene-expression profiles were performed as described. Furthermore, single-color geneexpression profiles were generated for 478/499 neuroblastoma samples of the MAQC-II dual-color training and validation sets (training set 244/246; validation set 234/253). For the remaining 21 samples no single-color data were available, due to either shortage of tumor material of these patients (n=15), poor experimental quality of the generated single-color profiles (n=5), or correlation of one single-color profile to two different dual-color profiles for the one patient profiled with both versions of the 2x11K microarrays (n=1). Single-color gene-expression profiles were generated using customized 4x44K oligonucleotide microarrays produced by Agilent Technologies (Palo Alto, CA, USA). These 4x44K microarrays included all probes represented by Agilent's Whole Human Genome Oligo Microarray and all probes of the version V2.0 of the 2x11K customized microarray that were not present in the former probe set. Labeling and hybridization was performed following the manufacturer's protocol as described. This SuperSeries is composed of the following subset Series: GSE20194: MAQC-II Project: human breast cancer (BR) data set Additional subset Series will be submitted later Refer to individual Series"