Unknown

Dataset Information

0

MaSC: mappability-sensitive cross-correlation for estimating mean fragment length of single-end short-read sequencing data.


ABSTRACT:

Motivation

Reliable estimation of the mean fragment length for next-generation short-read sequencing data is an important step in next-generation sequencing analysis pipelines, most notably because of its impact on the accuracy of the enriched regions identified by peak-calling algorithms. Although many peak-calling algorithms include a fragment-length estimation subroutine, the problem has not been adequately solved, as demonstrated by the variability of the estimates returned by different algorithms.

Results

In this article, we investigate the use of strand cross-correlation to estimate mean fragment length of single-end data and show that traditional estimation approaches have mixed reliability. We observe that the mappability of different parts of the genome can introduce an artificial bias into cross-correlation computations, resulting in incorrect fragment-length estimates. We propose a new approach, called mappability-sensitive cross-correlation (MaSC), which removes this bias and allows for accurate and reliable fragment-length estimation. We analyze the computational complexity of this approach, and evaluate its performance on a test suite of NGS datasets, demonstrating its superiority to traditional cross-correlation analysis.

Availability

An open-source Perl implementation of our approach is available at http://www.perkinslab.ca/Software.html.

SUBMITTER: Ramachandran P 

PROVIDER: S-EPMC3570216 | biostudies-literature | 2013 Feb

REPOSITORIES: biostudies-literature

altmetric image

Publications

MaSC: mappability-sensitive cross-correlation for estimating mean fragment length of single-end short-read sequencing data.

Ramachandran Parameswaran P   Palidwor Gareth A GA   Porter Christopher J CJ   Perkins Theodore J TJ  

Bioinformatics (Oxford, England) 20130107 4


<h4>Motivation</h4>Reliable estimation of the mean fragment length for next-generation short-read sequencing data is an important step in next-generation sequencing analysis pipelines, most notably because of its impact on the accuracy of the enriched regions identified by peak-calling algorithms. Although many peak-calling algorithms include a fragment-length estimation subroutine, the problem has not been adequately solved, as demonstrated by the variability of the estimates returned by differ  ...[more]

Similar Datasets

| S-EPMC5657049 | biostudies-literature
| S-EPMC9482146 | biostudies-literature
| S-EPMC3307109 | biostudies-literature
| S-EPMC6617613 | biostudies-literature
| S-EPMC3413383 | biostudies-literature
| S-EPMC2258652 | biostudies-literature
| S-EPMC153413 | biostudies-literature
| S-EPMC4442028 | biostudies-literature
| S-EPMC7671308 | biostudies-literature
| S-EPMC4896370 | biostudies-literature