Unknown

Dataset Information

0

SEME: a fast mapper of Illumina sequencing reads with statistical evaluation.


ABSTRACT: Mapping reads to a reference genome is a routine yet computationally intensive task in research based on high-throughput sequencing. In recent years, the sequencing reads of the Illumina platform have become longer and their quality scores higher. According to our calculation, this allows perfect k-mer seed match for almost all reads when a close reference genome is available subject to reasonable specificity. Our other observation is that the majority reads contain at most one short INDEL polymorphism. Based on these observations, we propose a fast-mapping approach, referred to as "SEME," which has two core steps: First it scans a read sequentially in a specific order for a k-mer exact match seed; next it extends the alignment on both sides allowing, at most, one short INDEL each using a novel method called "auto-match function." We decompose the evaluation of the sensitivity and specificity into two parts corresponding to the seed and extension step, and the composite result provides an approximate overall reliability estimate of each mapping. We compare SEME with some existing mapping methods on several datasets, and SEME shows better performance in terms of both running time and mapping rates.

SUBMITTER: Chen S 

PROVIDER: S-EPMC3822393 | biostudies-literature | 2013 Nov

REPOSITORIES: biostudies-literature

altmetric image

Publications

SEME: a fast mapper of Illumina sequencing reads with statistical evaluation.

Chen Shijian S   Wang Anqi A   Li Lei M LM  

Journal of computational biology : a journal of computational molecular cell biology 20131101 11


Mapping reads to a reference genome is a routine yet computationally intensive task in research based on high-throughput sequencing. In recent years, the sequencing reads of the Illumina platform have become longer and their quality scores higher. According to our calculation, this allows perfect k-mer seed match for almost all reads when a close reference genome is available subject to reasonable specificity. Our other observation is that the majority reads contain at most one short INDEL polym  ...[more]

Similar Datasets

| S-EPMC3491410 | biostudies-literature
| S-EPMC11222498 | biostudies-literature
| S-EPMC4191382 | biostudies-literature
| S-EPMC6580563 | biostudies-literature
| S-EPMC7320720 | biostudies-literature
| S-EPMC4835549 | biostudies-literature
| S-EPMC3462201 | biostudies-literature
| S-EPMC4471408 | biostudies-literature
| S-EPMC5834899 | biostudies-literature
| S-EPMC6035725 | biostudies-literature