Unknown

Dataset Information

0

Probabilistic base calling of Solexa sequencing data.


ABSTRACT: BACKGROUND: Solexa/Illumina short-read ultra-high throughput DNA sequencing technology produces millions of short tags (up to 36 bases) by parallel sequencing-by-synthesis of DNA colonies. The processing and statistical analysis of such high-throughput data poses new challenges; currently a fair proportion of the tags are routinely discarded due to an inability to match them to a reference sequence, thereby reducing the effective throughput of the technology. RESULTS: We propose a novel base calling algorithm using model-based clustering and probability theory to identify ambiguous bases and code them with IUPAC symbols. We also select optimal sub-tags using a score based on information content to remove uncertain bases towards the ends of the reads. CONCLUSION: We show that the method improves genome coverage and number of usable tags as compared with Solexa's data processing pipeline by an average of 15%. An R package is provided which allows fast and accurate base calling of Solexa's fluorescence intensity files and the production of informative diagnostic plots.

SUBMITTER: Rougemont J 

PROVIDER: S-EPMC2575221 | biostudies-literature | 2008

REPOSITORIES: biostudies-literature

altmetric image

Publications

Probabilistic base calling of Solexa sequencing data.

Rougemont Jacques J   Amzallag Arnaud A   Iseli Christian C   Farinelli Laurent L   Xenarios Ioannis I   Naef Felix F  

BMC bioinformatics 20081013


<h4>Background</h4>Solexa/Illumina short-read ultra-high throughput DNA sequencing technology produces millions of short tags (up to 36 bases) by parallel sequencing-by-synthesis of DNA colonies. The processing and statistical analysis of such high-throughput data poses new challenges; currently a fair proportion of the tags are routinely discarded due to an inability to match them to a reference sequence, thereby reducing the effective throughput of the technology.<h4>Results</h4>We propose a n  ...[more]

Similar Datasets

| S-EPMC4053729 | biostudies-literature
| PRJEB36644 | ENA
| S-EPMC2734321 | biostudies-literature
| S-EPMC3776450 | biostudies-literature
| S-EPMC5427492 | biostudies-literature
| S-EPMC3557274 | biostudies-literature
| S-EPMC3404070 | biostudies-literature
| S-EPMC8277855 | biostudies-literature
| S-EPMC6722845 | biostudies-literature
| S-EPMC5788064 | biostudies-literature