Dataset Information

On the Use of the Pearson Correlation Coefficient for Model Evaluation in Genome-Wide Prediction.

ABSTRACT: The large number of markers in genome-wide prediction demands the use of methods with regularization and model comparison based on some hold-out test prediction error measure. In quantitative genetics, it is common practice to calculate the Pearson correlation coefficient (r2 ) as a standardized measure of the predictive accuracy of a model. Based on arguments from the bias-variance trade-off theory in statistical learning, we show that shrinkage of the regression coefficients (i.e., QTL effects) reduces the prediction mean squared error (MSE) by introducing model bias compared with the ordinary least squares method. We also show that the LASSO and the adaptive LASSO (ALASSO) can reduce the model bias and prediction MSE by adding model variance. In an application of ridge regression, the LASSO and ALASSO to a simulated example based on results for 9,723 SNPs and 3,226 individuals, the best model selected was with the LASSO when r2 was used as a measure. However, when model selection was based on test MSE and coefficient of determination R2 the ALASSO proved to be the best method. Hence, use of r2 may lead to selection of the wrong model and therefore also nonoptimal ranking of phenotype predictions and genomic breeding values. Instead, we propose use of the test MSE for model selection and R2 as a standardized measure of the accuracy.

SUBMITTER: Waldmann P

PROVIDER: S-EPMC6781837 | biostudies-literature | 2019

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

On the Use of the Pearson Correlation Coefficient for Model Evaluation in Genome-Wide Prediction.

Waldmann Patrik P

Frontiers in genetics 20190926

The large number of markers in genome-wide prediction demands the use of methods with regularization and model comparison based on some hold-out test prediction error measure. In quantitative genetics, it is common practice to calculate the Pearson correlation coefficient (<i>r<sup>2</sup></i> ) as a standardized measure of the predictive accuracy of a model. Based on arguments from the bias-variance trade-off theory in statistical learning, we show that shrinkage of the regression coefficients ...[more]

PMID: 31632436

Similar Datasets

Project description:The revolution in fluorescence microscopy enables sub-diffraction-limit ("superresolution") localization of hundreds or thousands of copies of two differently labeled proteins in the same live cell. In typical experiments, fluorescence from the entire three-dimensional (3D) cell body is projected along the z-axis of the microscope to form a 2D image at the camera plane. For imaging of two different species, here denoted "red" and "green", a significant biological question is the extent to which the red and green spatial distributions are positively correlated, anti-correlated, or uncorrelated. A commonly used statistic for assessing the degree of linear correlation between two image matrices R and G is the Pearson Correlation Coefficient (PCC). PCC should vary from - 1 (perfect anti-correlation) to 0 (no linear correlation) to + 1 (perfect positive correlation). However, in the special case of spherocylindrical bacterial cells such as E. coli or B. subtilis, we show that the PCC fails both qualitatively and quantitatively. PCC returns the same + 1 value for 2D projections of distributions that are either perfectly correlated in 3D or completely uncorrelated in 3D. The PCC also systematically underestimates the degree of anti-correlation between the projections of two perfectly anti-correlated 3D distributions. The problem is that the projection of a random spatial distribution within the 3D spherocylinder is non-random in 2D, whereas PCC compares every matrix element of R or G with the constant mean value [Formula: see text] or [Formula: see text]. We propose a modified Pearson Correlation Coefficient (MPCC) that corrects this problem for spherocylindrical cell geometry by using the proper reference matrix for comparison with R and G. Correct behavior of MPCC is confirmed for a variety of numerical simulations and on experimental distributions of HU and RNA polymerase in live E. coli cells. The MPCC concept should be generalizable to other cell shapes.

Project description:BackgroundCurrently, clustering with some form of correlation coefficient as the gene similarity metric has become a popular method for profiling genomic data. The Pearson correlation coefficient and the standard deviation (SD)-weighted correlation coefficient are the two most widely-used correlations as the similarity metrics in clustering microarray data. However, these two correlations are not optimal for analyzing replicated microarray data generated by most laboratories. An effective correlation coefficient is needed to provide statistically sufficient analysis of replicated microarray data.ResultsIn this study, we describe a novel correlation coefficient, shrinkage correlation coefficient (SCC), that fully exploits the similarity between the replicated microarray experimental samples. The methodology considers both the number of replicates and the variance within each experimental group in clustering expression data, and provides a robust statistical estimation of the error of replicated microarray data. The value of SCC is revealed by its comparison with two other correlation coefficients that are currently the most widely-used (Pearson correlation coefficient and SD-weighted correlation coefficient) using statistical measures on both synthetic expression data as well as real gene expression data from Saccharomyces cerevisiae. Two leading clustering methods, hierarchical and k-means clustering were applied for the comparison. The comparison indicated that using SCC achieves better clustering performance. Applying SCC-based hierarchical clustering to the replicated microarray data obtained from germinating spores of the fern Ceratopteris richardii, we discovered two clusters of genes with shared expression patterns during spore germination. Functional analysis suggested that some of the genetic mechanisms that control germination in such diverse plant lineages as mosses and angiosperms are also conserved among ferns.ConclusionThis study shows that SCC is an alternative to the Pearson correlation coefficient and the SD-weighted correlation coefficient, and is particularly useful for clustering replicated microarray data. This computational approach should be generally useful for proteomic data or other high-throughput analysis methodology.

Dataset Information

On the Use of the Pearson Correlation Coefficient for Model Evaluation in Genome-Wide Prediction.

Publications

On the Use of the Pearson Correlation Coefficient for Model Evaluation in Genome-Wide Prediction.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets