Dataset Information

Validation and characterization of DNA microarray gene expression data distribution and associated moments.

ABSTRACT:

Background

The data from DNA microarrays are increasingly being used in order to understand effects of different conditions, exposures or diseases on the modulation of the expression of various genes in a biological system. This knowledge is then further used in order to generate molecular mechanistic hypotheses for an organism when it is exposed to different conditions. Several different methods have been proposed to analyze these data under different distributional assumptions on gene expression. However, the empirical validation of these assumptions is lacking.

Results

Best fit hypotheses tests, moment-ratio diagrams and relationships between the different moments of the distribution of the gene expression was used to characterize the observed distributions. The data are obtained from the publicly available gene expression database, Gene Expression Omnibus (GEO) to characterize the empirical distributions of gene expressions obtained under varying experimental situations each of which providing relatively large number of samples for hypothesis testing. All data were obtained from either of two microarray platforms--the commercial Affymetrix mouse 430.2 platform and a non-commercial Rosetta/Merck one. The data from each platform were preprocessed in the same manner.

Conclusions

The null hypotheses for goodness of fit for all considered univariate theoretical probability distributions (including the Normal distribution) are rejected for more than 50% of probe sets on the Affymetrix microarray platform at a 95% confidence level, suggesting that under the tested conditions a priori assumption of any of these distributions across all probe sets is not valid. The pattern of null hypotheses rejection was different for the data from Rosetta/Merck platform with only around 20% of the probe sets failing the logistic distribution goodness-of-fit test. We find that there are statistically significant (at 95% confidence level based on the F-test for the fitted linear model) relationships between the mean and the logarithm of the coefficient of variation of the distributions of the logarithm of gene expressions. An additional novel statistically significant quadratic relationship between the skewness and kurtosis is identified. Data from both microarray platforms fail to identify with any one of the chosen theoretical probability distributions from an analysis of the l-moment ratio diagram.

SUBMITTER: Thomas R

PROVIDER: S-EPMC3002903 | biostudies-literature | 2010 Nov

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Validation and characterization of DNA microarray gene expression data distribution and associated moments.

Thomas Reuben R de la Torre Luis L Chang Xiaoqing X Mehrotra Sanjay S

BMC bioinformatics 20101124

<h4>Background</h4>The data from DNA microarrays are increasingly being used in order to understand effects of different conditions, exposures or diseases on the modulation of the expression of various genes in a biological system. This knowledge is then further used in order to generate molecular mechanistic hypotheses for an organism when it is exposed to different conditions. Several different methods have been proposed to analyze these data under different distributional assumptions on gene ...[more]

PMID: 21092329

Similar Datasets

Project description:Gene microarray technology is an effective tool to investigate the simultaneous activity of multiple cellular pathways from hundreds to thousands of genes. However, because data in the colossal amounts generated by DNA microarray technology are usually complex, noisy, high-dimensional, and often hindered by low statistical power, their exploitation is difficult. To overcome these problems, two kinds of unsupervised analysis methods for microarray data: principal component analysis (PCA) and independent component analysis (ICA) have been developed to accomplish the task. PCA projects the data into a new space spanned by the principal components that are mutually orthonormal to each other. The constraint of mutual orthogonality and second-order statistics technique within PCA algorithms, however, may not be applied to the biological systems studied. Extracting and characterizing the most informative features of the biological signals, however, require higher-order statistics.ICA is one of the unsupervised algorithms that can extract higher-order statistical structures from data and has been applied to DNA microarray gene expression data analysis. We performed FastICA method on DNA microarray gene expression data from Alzheimer's disease (AD) hippocampal tissue samples and consequential gene clustering. Experimental results showed that the ICA method can improve the clustering results of AD samples and identify significant genes. More than 50 significant genes with high expression levels in severe AD were extracted, representing immunity-related protein, metal-related protein, membrane protein, lipoprotein, neuropeptide, cytoskeleton protein, cellular binding protein, and ribosomal protein. Within the aforementioned categories, our method also found 37 significant genes with low expression levels. Moreover, it is worth noting that some oncogenes and phosphorylation-related proteins are expressed in low levels. In comparison to the PCA and support vector machine recursive feature elimination (SVM-RFE) methods, which are widely used in microarray data analysis, ICA can identify more AD-related genes. Furthermore, we have validated and identified many genes that are associated with AD pathogenesis.We demonstrated that ICA exploits higher-order statistics to identify gene expression profiles as linear combinations of elementary expression patterns that lead to the construction of potential AD-related pathogenic pathways. Our computing results also validated that the ICA model outperformed PCA and the SVM-RFE method. This report shows that ICA as a microarray data analysis tool can help us to elucidate the molecular taxonomy of AD and other multifactorial and polygenic complex diseases.

Project description:BackgroundRice is one of the major crop species in the world helping to sustain approximately half of the global population's diet especially in Asia. However, due to the impact of extreme climate change and global warming, rice crop production and yields may be adversely affected resulting in a world food crisis. Researchers have been keen to understand the effects of drought, temperature and other environmental stress factors on rice plant growth and development. Gene expression microarray technology represents a key strategy for the identification of genes and their associated expression patterns in response to stress. Here, we report on the development of the rice OneArray® microarray platform which is suitable for two major rice subspecies, japonica and indica.ResultsThe rice OneArray® 60-mer, oligonucleotide microarray consists of a total of 21,179 probes covering 20,806 genes of japonica and 13,683 genes of indica. Through a validation study, total RNA isolated from rice shoots and roots were used for comparison of gene expression profiles via microarray examination. The results were submitted to NCBI's Gene Expression Omnibus (GEO). Data can be found under the GEO accession number GSE50844 (http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE50844). A list of significantly differentially expressed genes was generated; 438 shoot-specific genes were identified among 3,138 up-regulated genes, and 463 root-specific genes were found among 3,845 down-regulated genes. GO enrichment analysis demonstrates these results are in agreement with the known physiological processes of the different organs/tissues. Furthermore, qRT-PCR validation was performed on 66 genes, and found to significantly correlate with the microarray results (R = 0.95, p < 0.001***).ConclusionThe rice OneArray® 22 K microarray, the first rice microarray, covering both japonica and indica subspecies was designed and validated in a comprehensive study of gene expression in rice tissues. The rice OneArray® microarray platform revealed high specificity and sensitivity. Additional information for the rice OneArray® microarray can be found at http://www.phalanx.com.tw/index.php.

Project description:BACKGROUND: An important use of data obtained from microarray measurements is the classification of tumor types with respect to genes that are either up or down regulated in specific cancer types. A number of algorithms have been proposed to obtain such classifications. These algorithms usually require parameter optimization to obtain accurate results depending on the type of data. Additionally, it is highly critical to find an optimal set of markers among those up or down regulated genes that can be clinically utilized to build assays for the diagnosis or to follow progression of specific cancer types. In this paper, we employ a mixed integer programming based classification algorithm named hyper-box enclosure method (HBE) for the classification of some cancer types with a minimal set of predictor genes. This optimization based method which is a user friendly and efficient classifier may allow the clinicians to diagnose and follow progression of certain cancer types. METHODOLOGY/PRINCIPAL FINDINGS: We apply HBE algorithm to some well known data sets such as leukemia, prostate cancer, diffuse large B-cell lymphoma (DLBCL), small round blue cell tumors (SRBCT) to find some predictor genes that can be utilized for diagnosis and prognosis in a robust manner with a high accuracy. Our approach does not require any modification or parameter optimization for each data set. Additionally, information gain attribute evaluator, relief attribute evaluator and correlation-based feature selection methods are employed for the gene selection. The results are compared with those from other studies and biological roles of selected genes in corresponding cancer type are described. CONCLUSIONS/SIGNIFICANCE: The performance of our algorithm overall was better than the other algorithms reported in the literature and classifiers found in WEKA data-mining package. Since it does not require a parameter optimization and it performs consistently very high prediction rate on different type of data sets, HBE method is an effective and consistent tool for cancer type prediction with a small number of gene markers.

Dataset Information

Validation and characterization of DNA microarray gene expression data distribution and associated moments.

Background

Results

Conclusions

Publications

Validation and characterization of DNA microarray gene expression data distribution and associated moments.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets