Dataset Information

A better sequence-read simulator program for metagenomics.

ABSTRACT:

Background

There are many programs available for generating simulated whole-genome shotgun sequence reads. The data generated by many of these programs follow predefined models, which limits their use to the authors' original intentions. For example, many models assume that read lengths follow a uniform or normal distribution. Other programs generate models from actual sequencing data, but are limited to reads from single-genome studies. To our knowledge, there are no programs that allow a user to generate simulated data following non-parametric read-length distributions and quality profiles based on empirically-derived information from metagenomics sequencing data.

Results

We present BEAR (Better Emulation for Artificial Reads), a program that uses a machine-learning approach to generate reads with lengths and quality values that closely match empirically-derived distributions. BEAR can emulate reads from various sequencing platforms, including Illumina, 454, and Ion Torrent. BEAR requires minimal user input, as it automatically determines appropriate parameter settings from user-supplied data. BEAR also uses a unique method for deriving run-specific error rates, and extracts useful statistics from the metagenomic data itself, such as quality-error models. Many existing simulators are specific to a particular sequencing technology; however, BEAR is not restricted in this way. Because of its flexibility, BEAR is particularly useful for emulating the behaviour of technologies like Ion Torrent, for which no dedicated sequencing simulators are currently available. BEAR is also the first metagenomic sequencing simulator program that automates the process of generating abundances, which can be an arduous task.

Conclusions

BEAR is useful for evaluating data processing tools in genomics. It has many advantages over existing comparable software, such as generating more realistic reads and being independent of sequencing technology, and has features particularly useful for metagenomics work.

SUBMITTER: Johnson S

PROVIDER: S-EPMC4168713 | biostudies-literature | 2014

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

A better sequence-read simulator program for metagenomics.

Johnson Stephen S Trost Brett B Long Jeffrey R JR Pittet Vanessa V Kusalik Anthony A

BMC bioinformatics 20140910

<h4>Background</h4>There are many programs available for generating simulated whole-genome shotgun sequence reads. The data generated by many of these programs follow predefined models, which limits their use to the authors' original intentions. For example, many models assume that read lengths follow a uniform or normal distribution. Other programs generate models from actual sequencing data, but are limited to reads from single-genome studies. To our knowledge, there are no programs that allow ...[more]

PMID: 25253095

Dataset Information

A better sequence-read simulator program for metagenomics.

Background

Results

Conclusions

Publications

A better sequence-read simulator program for metagenomics.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

NanoSim: nanopore sequence read simulator based on statistical characterization.
| S-EPMC5530317 | biostudies-literature

Metagenomics: read length matters.
| S-EPMC2258652 | biostudies-literature

ART: a next-generation sequencing read simulator.
| S-EPMC3278762 | biostudies-literature

MetaSim: a sequencing simulator for genomics and metagenomics.
| S-EPMC2556396 | biostudies-literature

NeSSM: a Next-generation Sequencing Simulator for Metagenomics.
| S-EPMC3790878 | biostudies-literature

Unlocking short read sequencing for metagenomics.
| S-EPMC2911387 | biostudies-literature

SInC: an accurate and fast error-model based simulator for SNPs, Indels and CNVs coupled with a read generator for short-read sequence data.
| S-EPMC3926339 | biostudies-literature

Benchmarking short-read metagenomics tools for removing host contamination.
| S-EPMC11878760 | biostudies-literature

scReadSim: a single-cell RNA-seq and ATAC-seq read simulator.
| S-EPMC10657386 | biostudies-literature

COMPASS: the COMPletely Arbitrary Sequence Simulator.
| S-EPMC5870535 | biostudies-literature