Unknown

Dataset Information

0

Improved base-calling and quality scores for 454 sequencing based on a Hurdle Poisson model.


ABSTRACT:

Background

454 pyrosequencing is a commonly used massively parallel DNA sequencing technology with a wide variety of application fields such as epigenetics, metagenomics and transcriptomics. A well-known problem of this platform is its sensitivity to base-calling insertion and deletion errors, particularly in the presence of long homopolymers. In addition, the base-call quality scores are not informative with respect to whether an insertion or a deletion error is more likely. Surprisingly, not much effort has been devoted to the development of improved base-calling methods and more intuitive quality scores for this platform.

Results

We present HPCall, a 454 base-calling method based on a weighted Hurdle Poisson model. HPCall uses a probabilistic framework to call the homopolymer lengths in the sequence by modeling well-known 454 noise predictors. Base-calling quality is assessed based on estimated probabilities for each homopolymer length, which are easily transformed to useful quality scores.

Conclusions

Using a reference data set of the Escherichia coli K-12 strain, we show that HPCall produces superior quality scores that are very informative towards possible insertion and deletion errors, while maintaining a base-calling accuracy that is better than the current one. Given the generality of the framework, HPCall has the potential to also adapt to other homopolymer-sensitive sequencing technologies.

SUBMITTER: Beuf KD 

PROVIDER: S-EPMC3534400 | biostudies-literature | 2012 Nov

REPOSITORIES: biostudies-literature

altmetric image

Publications

Improved base-calling and quality scores for 454 sequencing based on a Hurdle Poisson model.

Beuf Kristof De KD   Schrijver Joachim De JD   Thas Olivier O   Criekinge Wim Van WV   Irizarry Rafael A RA   Clement Lieven L  

BMC bioinformatics 20121115


<h4>Background</h4>454 pyrosequencing is a commonly used massively parallel DNA sequencing technology with a wide variety of application fields such as epigenetics, metagenomics and transcriptomics. A well-known problem of this platform is its sensitivity to base-calling insertion and deletion errors, particularly in the presence of long homopolymers. In addition, the base-call quality scores are not informative with respect to whether an insertion or a deletion error is more likely. Surprisingl  ...[more]

Similar Datasets

| S-EPMC9237988 | biostudies-literature
| S-EPMC9825753 | biostudies-literature
| S-EPMC2577856 | biostudies-literature
| S-EPMC9167575 | biostudies-literature
| S-EPMC2575221 | biostudies-literature
| S-EPMC5862240 | biostudies-literature
| S-EPMC2745764 | biostudies-literature
| PRJEB36644 | ENA
| S-EPMC3776450 | biostudies-literature
| S-EPMC3557274 | biostudies-literature