Dataset Information

Imputation of Truncated p-Values For Meta-Analysis Methods and Its Genomic Application.

ABSTRACT: Microarray analysis to monitor expression activities in thousands of genes simultaneously has become routine in biomedical research during the past decade. a tremendous amount of expression profiles are generated and stored in the public domain and information integration by meta-analysis to detect differentially expressed (DE) genes has become popular to obtain increased statistical power and validated findings. Methods that aggregate transformed p-value evidence have been widely used in genomic settings, among which Fisher's and Stouffer's methods are the most popular ones. In practice, raw data and p-values of DE evidence are often not available in genomic studies that are to be combined. Instead, only the detected DE gene lists under a certain p-value threshold (e.g., DE genes with p-value < 0.001) are reported in journal publications. The truncated p-value information makes the aforementioned meta-analysis methods inapplicable and researchers are forced to apply a less efficient vote counting method or naïvely drop the studies with incomplete information. The purpose of this paper is to develop effective meta-analysis methods for such situations with partially censored p-values. We developed and compared three imputation methods-mean imputation, single random imputation and multiple imputation-for a general class of evidence aggregation methods of which Fisher's and Stouffer's methods are special examples. The null distribution of each method was analytically derived and subsequent inference and genomic analysis frameworks were established. Simulations were performed to investigate the type Ierror, power and the control of false discovery rate (FDR) for (correlated) gene expression data. The proposed methods were applied to several genomic applications in colorectal cancer, pain and liquid association analysis of major depressive disorder (MDD). The results showed that imputation methods outperformed existing naïve approaches. Mean imputation and multiple imputation methods performed the best and are recommended for future applications.

SUBMITTER: Tang S

PROVIDER: S-EPMC4274812 | biostudies-literature | 2014 Dec

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Imputation of Truncated <i>p</i>-Values For Meta-Analysis Methods and Its Genomic Application.

Tang Shaowu S Ding Ying Y Sibille Etienne E Mogil Jeffrey J Lariviere William R WR Tseng George C GC

The annals of applied statistics 20141201 4

Microarray analysis to monitor expression activities in thousands of genes simultaneously has become routine in biomedical research during the past decade. a tremendous amount of expression profiles are generated and stored in the public domain and information integration by meta-analysis to detect differentially expressed (DE) genes has become popular to obtain increased statistical power and validated findings. Methods that aggregate transformed <i>p</i>-value evidence have been widely used in ...[more]

PMID: 25541588

Similar Datasets

Project description:The real-world data analysis and processing using data mining techniques often are facing observations that contain missing values. The main challenge of mining datasets is the existence of missing values. The missing values in a dataset should be imputed using the imputation method to improve the data mining methods' accuracy and performance. There are existing techniques that use k-nearest neighbors algorithm for imputing the missing values but determining the appropriate k value can be a challenging task. There are other existing imputation techniques that are based on hard clustering algorithms. When records are not well-separated, as in the case of missing data, hard clustering provides a poor description tool in many cases. In general, the imputation depending on similar records is more accurate than the imputation depending on the entire dataset's records. Improving the similarity among records can result in improving the imputation performance. This paper proposes two numerical missing data imputation methods. A hybrid missing data imputation method is initially proposed, called KI, that incorporates k-nearest neighbors and iterative imputation algorithms. The best set of nearest neighbors for each missing record is discovered through the records similarity by using the k-nearest neighbors algorithm (kNN). To improve the similarity, a suitable k value is estimated automatically for the kNN. The iterative imputation method is then used to impute the missing values of the incomplete records by using the global correlation structure among the selected records. An enhanced hybrid missing data imputation method is then proposed, called FCKI, which is an extension of KI. It integrates fuzzy c-means, k-nearest neighbors, and iterative imputation algorithms to impute the missing data in a dataset. The fuzzy c-means algorithm is selected because the records can belong to multiple clusters at the same time. This can lead to further improvement for similarity. FCKI searches a cluster, instead of the whole dataset, to find the best k-nearest neighbors. It applies two levels of similarity to achieve a higher imputation accuracy. The performance of the proposed imputation techniques is assessed by using fifteen datasets with variant missing ratios for three types of missing data; MCAR, MAR, MNAR. These different missing data types are generated in this work. The datasets with different sizes are used in this paper to validate the model. Therefore, proposed imputation techniques are compared with other missing data imputation methods by means of three measures; the root mean square error (RMSE), the normalized root mean square error (NRMSE), and the mean absolute error (MAE). The results show that the proposed methods achieve better imputation accuracy and require significantly less time than other missing data imputation methods.

Project description:Genomic selection has the potential to increase genetic progress. Genotype imputation of high-density single-nucleotide polymorphism (SNP) genotypes can improve the cost efficiency of genomic breeding value (GEBV) prediction for pig breeding. Consequently, the objectives of this work were to: (1) estimate accuracy of genomic evaluation and GEBV for three traits in a Yorkshire population and (2) quantify the loss of accuracy of genomic evaluation and GEBV when genotypes were imputed under two scenarios: a high-cost, high-accuracy scenario in which only selection candidates were imputed from a low-density platform and a low-cost, low-accuracy scenario in which all animals were imputed using a small reference panel of haplotypes. Phenotypes and genotypes obtained with the PorcineSNP60 BeadChip were available for 983 Yorkshire boars. Genotypes of selection candidates were masked and imputed using tagSNP in the GeneSeek Genomic Profiler (10K). Imputation was performed with BEAGLE using 128 or 1800 haplotypes as reference panels. GEBV were obtained through an animal-centric ridge regression model using de-regressed breeding values as response variables. Accuracy of genomic evaluation was estimated as the correlation between estimated breeding values and GEBV in a 10-fold cross validation design. Accuracy of genomic evaluation using observed genotypes was high for all traits (0.65-0.68). Using genotypes imputed from a large reference panel (accuracy: R(2) = 0.95) for genomic evaluation did not significantly decrease accuracy, whereas a scenario with genotypes imputed from a small reference panel (R(2) = 0.88) did show a significant decrease in accuracy. Genomic evaluation based on imputed genotypes in selection candidates can be implemented at a fraction of the cost of a genomic evaluation using observed genotypes and still yield virtually the same accuracy. On the other side, using a very small reference panel of haplotypes to impute training animals and candidates for selection results in lower accuracy of genomic evaluation.

Project description:BackgroundMissing preadmission serum creatinine (SCr) values are a common obstacle to assess acute kidney injury (AKI) diagnosis and outcomes. The Kidney Disease Improving Global Outcomes (KDIGO) guidelines suggest using a SCr computed from the Modification of Diet in Renal Disease (MDRD) with an estimated glomerular filtration rate of 75 ml/min/1.73 m2. We aimed to identify the best surrogate method for baseline SCr to assess AKI diagnosis and outcomes.MethodsWe compared the use of 1) first SCr at hospital admission 2) minimal SCr over 2 weeks after intensive care unit admission 3) MDRD computed SCr and 4) Chronic Kidney Disease Epidemiology Collaboration (CKD-EPI) computed SCr to assess AKI diagnosis and outcomes. We then performed multilinear regression models to predict preadmission SCr and imputation strategies to assess AKI diagnosis.ResultsOur one-year retrospective cohort study included 1001 critically ill adults; 498 of them had preadmission SCr values. In these patients, AKI incidence was 25.1% using preadmission SCr. First SCr had the best agreement for AKI diagnosis (22.5%; kappa?=?0.90) and staging (kappa?=?0.81). MDRD, CKD-EPI and minimal SCr overestimated AKI diagnosis (26.7%, 27.1% and 43.2%;kappa?=?0.86, 0.86 and 0.60, respectively). However, MDRD and CKD-EPI computed SCr had a better sensitivity than first SCr for AKI (93% and 94% vs. 87%). Eighty-eight percent of patients experienced renal recovery at least 3 months after hospital discharge. All methods except the first SCr significantly underestimated the percentage of renal recovery. In a multivariate model, age, male gender, hypertension, heart failure, undergoing surgery and log first SCr best predicted preadmission SCr (adjusted R2?=?0.56). Imputation methods with first SCr increased AKI incidence to 23.9% (kappa?=?0.92) but not with MDRD computed SCr (26.7%;kappa?=?0.89).ConclusionIn our cohort, first SCr performed better for AKI diagnosis and staging, as well as for renal recovery after hospital discharge than MDRD, CKD-EPI or minimal SCr. However, MDRD SCr and CKD-EPI SCr improved AKI diagnosis sensitivity. Imputation methods minimally increased agreement for AKI diagnosis.

Dataset Information

Imputation of Truncated p-Values For Meta-Analysis Methods and Its Genomic Application.

Publications

Imputation of Truncated <i>p</i>-Values For Meta-Analysis Methods and Its Genomic Application.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets