Unknown

Dataset Information

0

MaSC: mappability-sensitive cross-correlation for estimating mean fragment length of single-end short-read sequencing data.


ABSTRACT: MOTIVATION: Reliable estimation of the mean fragment length for next-generation short-read sequencing data is an important step in next-generation sequencing analysis pipelines, most notably because of its impact on the accuracy of the enriched regions identified by peak-calling algorithms. Although many peak-calling algorithms include a fragment-length estimation subroutine, the problem has not been adequately solved, as demonstrated by the variability of the estimates returned by different algorithms. RESULTS: In this article, we investigate the use of strand cross-correlation to estimate mean fragment length of single-end data and show that traditional estimation approaches have mixed reliability. We observe that the mappability of different parts of the genome can introduce an artificial bias into cross-correlation computations, resulting in incorrect fragment-length estimates. We propose a new approach, called mappability-sensitive cross-correlation (MaSC), which removes this bias and allows for accurate and reliable fragment-length estimation. We analyze the computational complexity of this approach, and evaluate its performance on a test suite of NGS datasets, demonstrating its superiority to traditional cross-correlation analysis. AVAILABILITY: An open-source Perl implementation of our approach is available at http://www.perkinslab.ca/Software.html.

SUBMITTER: Ramachandran P 

PROVIDER: S-EPMC3570216 | biostudies-literature | 2013 Feb

REPOSITORIES: biostudies-literature

altmetric image

Publications

MaSC: mappability-sensitive cross-correlation for estimating mean fragment length of single-end short-read sequencing data.

Ramachandran Parameswaran P   Palidwor Gareth A GA   Porter Christopher J CJ   Perkins Theodore J TJ  

Bioinformatics (Oxford, England) 20130107 4


<h4>Motivation</h4>Reliable estimation of the mean fragment length for next-generation short-read sequencing data is an important step in next-generation sequencing analysis pipelines, most notably because of its impact on the accuracy of the enriched regions identified by peak-calling algorithms. Although many peak-calling algorithms include a fragment-length estimation subroutine, the problem has not been adequately solved, as demonstrated by the variability of the estimates returned by differ  ...[more]

Similar Datasets

| S-EPMC5657049 | biostudies-literature
| S-EPMC9482146 | biostudies-literature
| S-EPMC3307109 | biostudies-literature
| S-EPMC6617613 | biostudies-literature
| S-EPMC3413383 | biostudies-literature
| S-EPMC2258652 | biostudies-literature
| S-EPMC153413 | biostudies-literature
| S-EPMC4442028 | biostudies-literature
| S-EPMC7671308 | biostudies-literature
| S-EPMC4896370 | biostudies-literature