Unknown

Dataset Information

0

Beachmat: A Bioconductor C++ API for accessing high-throughput biological data from a variety of R matrix types.


ABSTRACT: Biological experiments involving genomics or other high-throughput assays typically yield a data matrix that can be explored and analyzed using the R programming language with packages from the Bioconductor project. Improvements in the throughput of these assays have resulted in an explosion of data even from routine experiments, which poses a challenge to the existing computational infrastructure for statistical data analysis. For example, single-cell RNA sequencing (scRNA-seq) experiments frequently generate large matrices containing expression values for each gene in each cell, requiring sparse or file-backed representations for memory-efficient manipulation in R. These alternative representations are not easily compatible with high-performance C++ code used for computationally intensive tasks in existing R/Bioconductor packages. Here, we describe a C++ interface named beachmat, which enables agnostic data access from various matrix representations. This allows package developers to write efficient C++ code that is interoperable with dense, sparse and file-backed matrices, amongst others. We evaluated the performance of beachmat for accessing data from each matrix representation using both simulated and real scRNA-seq data, and defined a clear memory/speed trade-off to motivate the choice of an appropriate representation. We also demonstrate how beachmat can be incorporated into the code of other packages to drive analyses of a very large scRNA-seq data set.

SUBMITTER: Lun ATL 

PROVIDER: S-EPMC5953501 | biostudies-literature | 2018 May

REPOSITORIES: biostudies-literature

altmetric image

Publications

beachmat: A Bioconductor C++ API for accessing high-throughput biological data from a variety of R matrix types.

Lun Aaron T L ATL   Pagès Hervé H   Smith Mike L ML  

PLoS computational biology 20180503 5


Biological experiments involving genomics or other high-throughput assays typically yield a data matrix that can be explored and analyzed using the R programming language with packages from the Bioconductor project. Improvements in the throughput of these assays have resulted in an explosion of data even from routine experiments, which poses a challenge to the existing computational infrastructure for statistical data analysis. For example, single-cell RNA sequencing (scRNA-seq) experiments freq  ...[more]

Similar Datasets

| S-EPMC4287624 | biostudies-literature
| S-EPMC4261523 | biostudies-literature
| S-EPMC5570157 | biostudies-literature
| S-EPMC4493686 | biostudies-literature
| S-EPMC7426509 | biostudies-literature
| S-EPMC5621122 | biostudies-literature
| S-EPMC1913543 | biostudies-literature
| S-EPMC2900279 | biostudies-other
| S-EPMC10507293 | biostudies-literature
| S-EPMC7821039 | biostudies-literature