Dataset Information

Efficient analysis of large-scale genome-wide data with two R packages: bigstatsr and bigsnpr.

ABSTRACT:

Motivation

Genome-wide datasets produced for association studies have dramatically increased in size over the past few years, with modern datasets commonly including millions of variants measured in dozens of thousands of individuals. This increase in data size is a major challenge severely slowing down genomic analyses, leading to some software becoming obsolete and researchers having limited access to diverse analysis tools.

Results

Here we present two R packages, bigstatsr and bigsnpr, allowing for the analysis of large scale genomic data to be performed within R. To address large data size, the packages use memory-mapping for accessing data matrices stored on disk instead of in RAM. To perform data pre-processing and data analysis, the packages integrate most of the tools that are commonly used, either through transparent system calls to existing software, or through updated or improved implementation of existing methods. In particular, the packages implement fast and accurate computations of principal component analysis and association studies, functions to remove single nucleotide polymorphisms in linkage disequilibrium and algorithms to learn polygenic risk scores on millions of single nucleotide polymorphisms. We illustrate applications of the two R packages by analyzing a case-control genomic dataset for celiac disease, performing an association study and computing polygenic risk scores. Finally, we demonstrate the scalability of the R packages by analyzing a simulated genome-wide dataset including 500?000 individuals and 1 million markers on a single desktop computer.

Availability and implementation

https://privefl.github.io/bigstatsr/ and https://privefl.github.io/bigsnpr/.

Supplementary information

Supplementary data are available at Bioinformatics online.

SUBMITTER: Prive F

PROVIDER: S-EPMC6084588 | biostudies-literature | 2018 Aug

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Efficient analysis of large-scale genome-wide data with two R packages: bigstatsr and bigsnpr.

Privé Florian F Aschard Hugues H Ziyatdinov Andrey A Blum Michael G B MGB

Bioinformatics (Oxford, England) 20180801 16

<h4>Motivation</h4>Genome-wide datasets produced for association studies have dramatically increased in size over the past few years, with modern datasets commonly including millions of variants measured in dozens of thousands of individuals. This increase in data size is a major challenge severely slowing down genomic analyses, leading to some software becoming obsolete and researchers having limited access to diverse analysis tools.<h4>Results</h4>Here we present two R packages, bigstatsr and ...[more]

PMID: 29617937

Similar Datasets

Project description:MotivationWe have proposed a mixture model based approach to the concordant integrative analysis of multiple large-scale two-sample expression datasets. Since the mixture model is based on the transformed differential expression test P-values (z-scores), it is generally applicable to the expression data generated by either microarray or RNA-seq platforms. The mixture model is simple with three normal distribution components for each dataset to represent down-regulation, up-regulation and no differential expression. However, when the number of datasets increases, the model parameter space increases exponentially due to the component combination from different datasets.ResultsIn this study, motivated by the well-known generalized estimating equations (GEEs) for longitudinal data analysis, we focus on the concordant components and assume that the proportions of non-concordant components follow a special structure. We discuss the exchangeable, multiset coefficient and autoregressive structures for model reduction, and their related expectation-maximization (EM) algorithms. Then, the parameter space is linear with the number of datasets. In our previous study, we have applied the general mixture model to three microarray datasets for lung cancer studies. We show that more gene sets (or pathways) can be detected by the reduced mixture model with the exchangeable structure. Furthermore, we show that more genes can also be detected by the reduced model. The Cancer Genome Atlas (TCGA) data have been increasingly collected. The advantage of incorporating the concordance feature has also been clearly demonstrated based on TCGA RNA sequencing data for studying two closely related types of cancer.Availability and implementationAdditional results are included in a supplemental file. Computer program R-functions are freely available at http://home.gwu.edu/∼ylai/research/Concordance.Contactylai@gwu.edu.Supplementary informationSupplementary data are available at Bioinformatics online.

Project description:Pathway analysis of genome-wide association studies (GWAS) offer a unique opportunity to collectively evaluate genetic variants with effects that are too small to be detected individually. We applied a pathway analysis to a bladder cancer GWAS containing data from 3,532 cases and 5,120 controls of European background (n?=?5 studies). Thirteen hundred and ninety-nine pathways were drawn from five publicly available resources (Biocarta, Kegg, NCI-PID, HumanCyc, and Reactome), and we constructed 22 additional candidate pathways previously hypothesized to be related to bladder cancer. In total, 1421 pathways, 5647 genes and ?90,000 SNPs were included in our study. Logistic regression model adjusting for age, sex, study, DNA source, and smoking status was used to assess the marginal trend effect of SNPs on bladder cancer risk. Two complementary pathway-based methods (gene-set enrichment analysis [GSEA], and adapted rank-truncated product [ARTP]) were used to assess the enrichment of association signals within each pathway. Eighteen pathways were detected by either GSEA or ARTP at P?0.01. To minimize false positives, we used the I(2) statistic to identify SNPs displaying heterogeneous effects across the five studies. After removing these SNPs, seven pathways ('Aromatic amine metabolism' [P(GSEA)?=?0.0100, P(ARTP)?=?0.0020], 'NAD biosynthesis' [P(GSEA)?=?0.0018, P(ARTP)?=?0.0086], 'NAD salvage' [P(ARTP)?=?0.0068], 'Clathrin derived vesicle budding' [P(ARTP)?=?0.0018], 'Lysosome vesicle biogenesis' [P(GSEA)?=?0.0023, P(ARTP)<0.00012], 'Retrograde neurotrophin signaling' [P(GSEA)?=?0.00840], and 'Mitotic metaphase/anaphase transition' [P(GSEA)?=?0.0040]) remained. These pathways seem to belong to three fundamental cellular processes (metabolic detoxification, mitosis, and clathrin-mediated vesicles). Identification of the aromatic amine metabolism pathway provides support for the ability of this approach to identify pathways with established relevance to bladder carcinogenesis.

Dataset Information

Efficient analysis of large-scale genome-wide data with two R packages: bigstatsr and bigsnpr.

Motivation

Results

Availability and implementation

Supplementary information

Publications

Efficient analysis of large-scale genome-wide data with two R packages: bigstatsr and bigsnpr.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets