Unknown

Dataset Information

0

Fast and robust ancestry prediction using principal component analysis.


ABSTRACT:

Motivation

Population stratification (PS) is a major confounder in genome-wide association studies (GWAS) and can lead to false-positive associations. To adjust for PS, principal component analysis (PCA)-based ancestry prediction has been widely used. Simple projection (SP) based on principal component loadings and the recently developed data augmentation, decomposition and Procrustes (ADP) transformation, such as LASER and TRACE, are popular methods for predicting PC scores. However, the predicted PC scores from SP can be biased toward NULL. On the other hand, ADP has a high computation cost because it requires running PCA separately for each study sample on the augmented dataset.

Results

We develop and propose two alternative approaches: bias-adjusted projection (AP) and online ADP (OADP). Using random matrix theory, AP asymptotically estimates and adjusts for the bias of SP. OADP uses a computationally efficient online singular value decomposition algorithm, which can greatly reduce the computation cost of ADP. We carried out extensive simulation studies to show that these alternative approaches are unbiased and the computation speed can be 16-16 000 times faster than ADP. We applied our approaches to the UK Biobank data of 488 366 study samples with 2492 samples from the 1000 Genomes data as the reference. AP and OADP required 0.82 and 21 CPU hours, respectively, while the projected computation time of ADP was 1628 CPU hours. Furthermore, when inferring sub-European ancestry, SP clearly showed bias, unlike the proposed approaches.

Availability and implementation

The OADP and AP methods, as well as SP and ADP, have been implemented in the open-source Python software FRAPOSA, available at github.com/daviddaiweizhang/fraposa.

Contact

leeshawn@umich.edu.

Supplementary information

Supplementary data are available at Bioinformatics online.

SUBMITTER: Zhang D 

PROVIDER: S-EPMC7267814 | biostudies-literature |

REPOSITORIES: biostudies-literature

Similar Datasets

| S-EPMC5515900 | biostudies-other
| S-EPMC8297813 | biostudies-literature
| S-EPMC3395474 | biostudies-literature
| S-EPMC10465116 | biostudies-literature
| S-EPMC5644186 | biostudies-literature
2019-02-26 | GSE120584 | GEO
| S-EPMC3981753 | biostudies-literature
| S-EPMC4720596 | biostudies-literature
| S-EPMC3182154 | biostudies-other
| S-EPMC10025984 | biostudies-literature