Unknown

Dataset Information

0

SeqWho: reliable, rapid determination of sequence file identity using k-mer frequencies in Random Forest classifiers.


ABSTRACT:

Motivation

With the vast improvements in sequencing technologies and increased number of protocols, sequencing is being used to answer complex biological problems. Subsequently, analysis pipelines have become more time consuming and complicated, usually requiring highly extensive prevalidation steps. Here, we present SeqWho, a program designed to assess heuristically the quality of sequencing files and reliably classify the organism and protocol type by using Random Forest classifiers trained on biases native in k-mer frequencies and repeat sequence identities.

Results

Using one of our primary models, we show that our method accurately and rapidly classifies human and mouse sequences from nine different sequencing libraries by species, library and both together, 98.32%, 97.86% and 96.38% of the time, respectively. Ultimately, we demonstrate that SeqWho is a powerful method for reliably validating the quality and identity of the sequencing files used in any pipeline.

Availability and implementation

https://github.com/DaehwanKimLab/seqwho.

Supplementary information

Supplementary data are available at Bioinformatics online.

SUBMITTER: Bennett C 

PROVIDER: S-EPMC8963323 | biostudies-literature | 2022 Mar

REPOSITORIES: biostudies-literature

altmetric image

Publications

SeqWho: reliable, rapid determination of sequence file identity using k-mer frequencies in Random Forest classifiers.

Bennett Christopher C   Thornton Micah M   Park Chanhee C   Henry Gervaise G   Zhang Yun Y   Malladi Venkat V   Kim Daehwan D  

Bioinformatics (Oxford, England) 20220301 7


<h4>Motivation</h4>With the vast improvements in sequencing technologies and increased number of protocols, sequencing is being used to answer complex biological problems. Subsequently, analysis pipelines have become more time consuming and complicated, usually requiring highly extensive prevalidation steps. Here, we present SeqWho, a program designed to assess heuristically the quality of sequencing files and reliably classify the organism and protocol type by using Random Forest classifiers tr  ...[more]

Similar Datasets

| S-EPMC5536784 | biostudies-literature
| S-EPMC2739274 | biostudies-literature
| S-EPMC6579812 | biostudies-literature
| S-EPMC5600879 | biostudies-literature
| S-EPMC7585298 | biostudies-literature
| S-EPMC6958757 | biostudies-literature
| S-EPMC8558547 | biostudies-literature
| S-EPMC5946877 | biostudies-literature
| S-EPMC10705532 | biostudies-literature
| S-EPMC4053362 | biostudies-literature