Unknown

Dataset Information

0

SPRISS: Approximating Frequent K-mers by Sampling Reads, and Applications.


ABSTRACT:

Motivation

The extraction of k-mers is a fundamental component in many complex analyses of large next-generation sequencing datasets, including reads classification in genomics and the characterization of RNA-seq datasets. The extraction of all k-mers and their frequencies is extremely demanding in terms of running time and memory, owing to the size of the data and to the exponential number of k-mers to be considered. However, in several applications, only frequent k-mers, which are k-mers appearing in a relatively high proportion of the data, are required by the analysis.

Results

In this work we present SPRISS, a new efficient algorithm to approximate frequent k-mers and their frequencies in next-generation sequencing data. SPRISS employs a simple yet powerful reads sampling scheme, which allows to extract a representative subset of the dataset that can be used, in combination with any k-mer counting algorithm, to perform downstream analyses in a fraction of the time required by the analysis of the whole data, while obtaining comparable answers. Our extensive experimental evaluation demonstrates the efficiency and accuracy of SPRISS in approximating frequent k-mers, and shows that it can be used in various scenarios, such as the comparison of metagenomic datasets, the identification of discriminative k-mers, and SNP (single nucleotide polymorphism) genotyping, to extract insights in a fraction of the time required by the analysis of the whole dataset.

Availability

SPRISS* is available at https://github.com/VandinLab/SPRISS.

Supplementary information

Supplementary data are available at Bioinformatics online.

SUBMITTER: Santoro D 

PROVIDER: S-EPMC9237683 | biostudies-literature |

REPOSITORIES: biostudies-literature

Similar Datasets

| S-EPMC3964109 | biostudies-literature
| S-EPMC6044908 | biostudies-literature
| S-EPMC4504488 | biostudies-literature
| S-EPMC7842384 | biostudies-literature
| S-EPMC6842140 | biostudies-literature
| S-EPMC6122726 | biostudies-literature
| S-EPMC3118166 | biostudies-literature
| PRJEB18734 | ENA
| S-EPMC5425679 | biostudies-literature
| S-EPMC6166224 | biostudies-literature