Dataset Information

Learning sparse log-ratios for high-throughput sequencing data.

ABSTRACT:

Motivation

The automatic discovery of sparse biomarkers that are associated with an outcome of interest is a central goal of bioinformatics. In the context of high-throughput sequencing (HTS) data, and compositional data (CoDa) more generally, an important class of biomarkers are the log-ratios between the input variables. However, identifying predictive log-ratio biomarkers from HTS data is a combinatorial optimization problem, which is computationally challenging. Existing methods are slow to run and scale poorly with the dimension of the input, which has limited their application to low- and moderate-dimensional metagenomic datasets.

Results

Building on recent advances from the field of deep learning, we present CoDaCoRe, a novel learning algorithm that identifies sparse, interpretable and predictive log-ratio biomarkers. Our algorithm exploits a continuous relaxation to approximate the underlying combinatorial optimization problem. This relaxation can then be optimized efficiently using the modern ML toolbox, in particular, gradient descent. As a result, CoDaCoRe runs several orders of magnitude faster than competing methods, all while achieving state-of-the-art performance in terms of predictive accuracy and sparsity. We verify the outperformance of CoDaCoRe across a wide range of microbiome, metabolite and microRNA benchmark datasets, as well as a particularly high-dimensional dataset that is outright computationally intractable for existing sparse log-ratio selection methods.

Availability and implementation

The CoDaCoRe package is available at https://github.com/egr95/R-codacore. Code and instructions for reproducing our results are available at https://github.com/cunningham-lab/codacore.

Supplementary information

Supplementary data are available at Bioinformatics online.

SUBMITTER: Gordon-Rodriguez E

PROVIDER: S-EPMC8696089 | biostudies-literature | 2021 Dec

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Learning sparse log-ratios for high-throughput sequencing data.

Gordon-Rodriguez Elliott E Quinn Thomas P TP Cunningham John P JP

Bioinformatics (Oxford, England) 20211201 1

<h4>Motivation</h4>The automatic discovery of sparse biomarkers that are associated with an outcome of interest is a central goal of bioinformatics. In the context of high-throughput sequencing (HTS) data, and compositional data (CoDa) more generally, an important class of biomarkers are the log-ratios between the input variables. However, identifying predictive log-ratio biomarkers from HTS data is a combinatorial optimization problem, which is computationally challenging. Existing methods are ...[more]

PMID: 34498030

Dataset Information

Learning sparse log-ratios for high-throughput sequencing data.

Motivation

Results

Availability and implementation

Supplementary information

Publications

Learning sparse log-ratios for high-throughput sequencing data.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

Super-sparse principal component analyses for high-throughput genomic data.
| S-EPMC2902448 | biostudies-literature

Log-ratio lasso: Scalable, sparse estimation for log-ratio models.
| S-EPMC9470385 | biostudies-literature

Compression of structured high-throughput sequencing data.
| S-EPMC3832420 | biostudies-literature

scParser: sparse representation learning for scalable single-cell RNA sequencing data analysis.
| S-EPMC11328435 | biostudies-literature

Population genomic inferences from sparse high-throughput sequencing of two populations of Drosophila melanogaster.
| S-EPMC2839279 | biostudies-literature

Detecting Alu insertions from high-throughput sequencing data.
| S-EPMC3783187 | biostudies-literature

Genotype-Frequency Estimation from High-Throughput Sequencing Data.
| S-EPMC4596663 | biostudies-literature

Savant: genome browser for high-throughput sequencing data.
| S-EPMC3271355 | biostudies-literature

ReSeq simulates realistic Illumina high-throughput sequencing data.
| S-EPMC7896392 | biostudies-literature

Linkage Disequilibrium Estimation in Low Coverage High-Throughput Sequencing Data.
| S-EPMC5972415 | biostudies-other