Unknown,Transcriptomics,Genomics,Proteomics

Dataset Information

Construction of a modular analysis framework for blood Genomics Studies

ABSTRACT: We designed a strategy for microarray analysis that is based on the identification of transcriptional modules formed by genes coordinately expressed in multiple disease data sets. Mapping changes in gene expression at the module level generated disease-specific transcriptional fingerprints that provide a stable framework for the visualization and functional interpretation of microarray data. The first step of the module-construction process analyzes expression patterns of transcripts across samples for individual diseases: sets of coordinately expressed transcripts were identified with an unsupervised clustering algorithm; in this case, the GeneSpring Version 7.1 (Agilent) implementation of the K-Means algorithm (k = 30). All transcripts detected in at least one sample were used as input; no screening for differential expression was performed. The second step of the module-construction process analyzed the “clustering behavior” of transcripts across diseases, taking into account the possibility that genes may cocluster in some diseases but not in others. Also, in our example, the transcripts that clustered together across all eight diseases were grouped to form a set of modules (round 1 of selection), and the stringency of the analysis was then decreased gradually to identify transcripts that belong to a similar K-means cluster in only a subset of diseases (round 2: seven out of eight diseases; round 3: six out of eight diseases). It is important to note that the module-selection process is “data-driven” and does not involve manual selection of genes by the investigator. We implemented the module-construction strategy described above, using as input a total of 239 peripheral-blood mononuclear cell (PBMC) samples obtained from individuals with one of the following conditions: systemic juvenile idiopathic arthritis (n = 47), systemic lupus erythematosus (n = 40), type I diabetes (n = 20), metastatic melanoma (n = 39), acute infections (Escherichia coli [n = 22], Staphylococcus aureus [n = 18], Influenza A [n = 16]) or liver-transplant recipients undergoing immunosuppressive therapy (n = 37). Transcriptional profiles were generated with Affymetrix U133A and U133B GeneChips (> 44,000 probe sets). A total of 4742 transcripts, distributed among 28 sets, were selected after running of the module-construction algorithm described above. Each module is assigned a unique identifier indicating the round and order of selection (i.e., M3.1 is the first module identified in the third round of selection). The stringency of this algorithm was tested statistically by implementation of the same module-construction procedure after randomization of the original data set. This process was repeated 200 times, without a single module identified. Therefore, the analysis of gene-cluster membership across multiple diseases provided a stringent means to identify PBMC transcriptional modules.

ORGANISM(S): Homo sapiens

SUBMITTER: Damien Chaussabel

PROVIDER: E-GEOD-11908 | biostudies-arrayexpress |

REPOSITORIES: biostudies-arrayexpress

ACCESS DATA

Publications

A modular analysis framework for blood genomics studies: application to systemic lupus erythematosus.

Chaussabel Damien D Quinn Charles C Shen Jing J Patel Pinakeen P Glaser Casey C Baldwin Nicole N Stichweh Dorothee D Blankenship Derek D Li Lei L Munagala Indira I Bennett Lynda L Allantaz Florence F Mejias Asuncion A Ardura Monica M Kaizer Ellen E Monnet Laurence L Allman Windy W Randall Henry H Johnson Diane D Lanier Aimee A Punaro Marilynn M Wittkowski Knut M KM White Perrin P Fay Joseph J Klintmalm Goran G Ramilo Octavio O Palucka A Karolina AK Banchereau Jacques J Pascual Virginia V

Immunity 20080701 1

The analysis of patient blood transcriptional profiles offers a means to investigate the immunological mechanisms relevant to human diseases on a genome-wide scale. In addition, such studies provide a basis for the discovery of clinically relevant biomarker signatures. We designed a strategy for microarray analysis that is based on the identification of transcriptional modules formed by genes coordinately expressed in multiple disease data sets. Mapping changes in gene expression at the module l ...[more]

PMID: 18631455

Similar Datasets

Project description:The analysis of patient blood transcriptional profiles offers a means to investigate the immunological mechanisms relevant to human diseases on a genome-wide scale. In addition, such studies provide a basis for the discovery of clinically relevant biomarker signatures. We designed a strategy for microarray analysis that is based on the identification of transcriptional modules formed by genes coordinately expressed in multiple disease data sets. Mapping changes in gene expression at the module level generated disease-specific transcriptional fingerprints that provide a stable framework for the visualization and functional interpretation of microarray data. These transcriptional modules were used as a basis for the selection of biomarkers and the development of a multivariate transcriptional indicator of disease progression in patients with systemic lupus erythematosus. Thus, this work describes the implementation and application of a methodology designed to support systems-scale analysis of the human immune system in translational research settings. Experiment Overall Design: Experiment subseries GSE11908 regroups the profiles that have been used to construct the modular transcriptional framework: Experiment Overall Design: The first step of the module-construction process analyzes expression patterns of transcripts across samples for individual diseases: sets of coordinately expressed transcripts were identified with an unsupervised clustering algorithm; in this case, the GeneSpring Version 7.1 (Agilent) implementation of the K-Means algorithm (k = 30). All transcripts detected in at least one sample were used as input; no screening for differential expression was performed. The second step of the module-construction process analyzed the âclustering behaviorâ of transcripts across diseases, taking into account the possibility that genes may cocluster in some diseases but not in others. Also, in our example, the transcripts that clustered together across all eight diseases were grouped to form a set of modules (round 1 of selection), and the stringency of the analysis was then decreased gradually to identify transcripts that belong to a similar K-means cluster in only a subset of diseases (round 2: seven out of eight diseases; round 3: six out of eight diseases). It is important to note that the module-selection process is âdata-drivenâ and does not involve manual selection of genes by the investigator. Experiment Overall Design: We implemented the module-construction strategy described above, using as input a total of 239 peripheral-blood mononuclear cell (PBMC) samples obtained from individuals with one of the following conditions: systemic juvenile idiopathic arthritis (n = 47), systemic lupus erythematosus (n = 40), type I diabetes (n = 20), metastatic melanoma (n = 39), acute infections (Escherichia coli [n = 22], Staphylococcus aureus [n = 18], Influenza A [n = 16]) or liver-transplant recipients undergoing immunosuppressive therapy (n = 37). Transcriptional profiles were generated with Affymetrix U133A and U133B GeneChips (> 44,000 probe sets). A total of 4742 transcripts, distributed among 28 sets, were selected after running of the module-construction algorithm described above. Each module is assigned a unique identifier indicating the round and order of selection (i.e., M3.1 is the first module identified in the third round of selection). Experiment Overall Design: The stringency of this algorithm was tested statistically by implementation of the same module-construction procedure after randomization of the original data set. This process was repeated 200 times, without a single module identified. Therefore, the analysis of gene-cluster membership across multiple diseases provided a stringent means to identify PBMC transcriptional modules. Experiment Overall Design: Experiment subseries GSE11909 regroups the profiles that have been used to identify and validate biomarkers of SLE disease activity: Experiment Overall Design: The proposed biomarker-selection strategy relies on modules for reducing highly dimensional microarray data sets in a stepwise manner. Starting from the full set of 28 modules, only those for which a set minimum proportion of transcripts are significantly changed between the study groups are selected (e.g., minimum proportion of differentially expressed transcripts at p < 0.05 = 15% overexpressed or underexpressed transcripts; in the example given, 11 SLE modules meet this criterion). This eliminates from the selection pool the modules registering fewer consistent changes that could be attributed to noise. Transcriptional vectors were derived for the entire cohort of 22 untreated pediatric SLE patients with the use of this set of 11 SLE modules. Patient profiles were also generated for an independent set of 31 children with SLE treated with steroids and/or cytotoxic drugs and/or hydroxychloroquine. A nonparametric method for analyzing multivariate ordinal data was used to score the patients. Lupus disease flares can lead to irreversible worsening of the patient's status. We tested the relevance of this multivariate transcriptional score for longitudinal monitoring of the disease activity in a cohort of 20 pediatric SLE patients (two to four time points/patient, intervals between each time point varied from one month to 18 months). Half of the patients had been included in our cross-sectional analysis before they were enrolled in this longitudinal study. Parallel trends were observed between multivariate transcriptional scores and a clinical severity score. The positive association was verified statistically with the use of a linear-regression model.

Project description:This dataset was used to establish whole blood transcriptional modules (n=260) that represent groups of coordinately expressed transcripts that exhibit altered abundance within individual datasets or across multiple datasets. This modular framework was generated to reduce the dimensionality of whole blood microarray data processed on the Illumina Beadchip platform yielding data-driven transcriptional modules with biologic meaning. This series combines nine independent datasets representing a spectrum of human pathologies expected to result in changes in gene abundance related to changes in expression or cellular composition of whole blood. These nine datasets are composed of 410 individual whole blood profiles generated from patients with HIV, tuberculosis, sepsis, systemic lupus erythematosus, systemic arthritis, B-cell deficiency and liver transplant. For each dataset healthy controls are also included. Each dataset’s expression data was preprocessed independently. First, probes were discarded if they were not present in at least ten percent of the dataset’s samples. Then, the sample data for each dataset was normalized using the BeadStudio average normalization algorithm. Once normalized, the signal was scaled such that all signals less than ten were set to ten. The signal median of all of the dataset’s samples was calculated for each probe. Probes were discarded if no sample had a difference in signal from the median that was greater than or equal to thirty, or if no sample had a fold change relative to the median that was either greater than or equal to 1.5, or less than or equal to 0.67. Finally, data was transformed to the log2 of the signal divided by the mean. Each of the preprocessed datasets was clustered in parallel using Euclidean distance and the Hartigan’s K-Means clustering algorithm, a hybrid of hierarchical and K-Means clustering algorithms. The number of clusters (k) was set to thirty, chosen to provide significant power during later module extraction steps. A higher value could have been chosen for k, but was not in order to minimize possibly arbitrary cluster splitting. Taking the nine sets of thirty clusters as input, we constructed a weighted co-cluster graph, a probe by probe matrix where the value of each cell (the weight) is set to the number of times probe_i and probe_j are found in the same cluster. In this instance, the values range from zero to nine, inclusive. At this point, the goal is to extract sets of probes that are most frequently clustered together, proceeding from the most stringent requirements to the least. To accomplish this, we employ the iterative algorithm. To begin, the maximum clique threshold is initialized to the number of input cluster sets, the paraclique threshold is calculated, and a minimum seed size is chosen (we used ten). The outer loop begins by creating an unweighted graph through application of the maximum clique threshold to the weighted co-cluster graph such that a probe pair, or edge, is represented in the unweighted graph if and only if the corresponding weight in the co-cluster graph equals or exceeds this threshold. We then begin the inner loop. The first step is to isolate the largest set of probes such that all pairs of probes in the set are completely connected in the unweighted graph - that is, there is no pair of probes in the set where the weight from the initial graph is smaller than the maximum clique threshold. In graph theoretic terms, the probes form a maximum clique. If the size of the probe set is smaller than the minimum seed size, we escape from the inner loop, reduce the threshold by one, and return to the beginning of the outer loop. Otherwise, the probe set is at least as large as the minimum seed size and it becomes the seed for a module. To allow for the inevitable clustering inaccuracies, we then employ the paraclique algorithm revisiting the co-cluster graph and adding to the seed any probe that is found to cluster with at least eighty-five percent of the seed’s members a number of times equal or exceeding the paraclique threshold. This final probe set is a module. It is removed from both graphs and named in accordance with the iterations in which it was found (i.e. a module extracted in the first iteration of the outer loop and the second iteration of the inner loop is designated M1.2). The inner loop then begins again with the reduced graphs. Those modules with conserved expression across diseases (formed by transcripts that cluster together for all nine datasets) were selected in early rounds whereas modules with greater disease specificity (formed by transcripts that cluster together only in a subset of the nine datasets) were selected in later rounds.

Dataset Information

Construction of a modular analysis framework for blood Genomics Studies

Publications

A modular analysis framework for blood genomics studies: application to systemic lupus erythematosus.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets