Unknown

Dataset Information

0

Utilizing mapping targets of sequences underrepresented in the reference assembly to reduce false positive alignments.


ABSTRACT: The human reference assembly remains incomplete due to the underrepresentation of repeat-rich sequences that are found within centromeric regions and acrocentric short arms. Although these sequences are marginally represented in the assembly, they are often fully represented in whole-genome short-read datasets and contribute to inappropriate alignments and high read-depth signals that localize to a small number of assembled homologous regions. As a consequence, these regions often provide artifactual peak calls that confound hypothesis testing and large-scale genomic studies. To address this problem, we have constructed mapping targets that represent roughly 8% of the human genome generally omitted from the human reference assembly. By integrating these data into standard mapping and peak-calling pipelines we demonstrate a 10-fold reduction in signals in regions common to the blacklisted region and identify a comprehensive set of regions that exhibit mapping sensitivity with the presence of the repeat-rich targets.

SUBMITTER: Miga KH 

PROVIDER: S-EPMC4787761 | biostudies-literature | 2015 Nov

REPOSITORIES: biostudies-literature

altmetric image

Publications

Utilizing mapping targets of sequences underrepresented in the reference assembly to reduce false positive alignments.

Miga Karen H KH   Eisenhart Christopher C   Kent W James WJ  

Nucleic acids research 20150710 20


The human reference assembly remains incomplete due to the underrepresentation of repeat-rich sequences that are found within centromeric regions and acrocentric short arms. Although these sequences are marginally represented in the assembly, they are often fully represented in whole-genome short-read datasets and contribute to inappropriate alignments and high read-depth signals that localize to a small number of assembled homologous regions. As a consequence, these regions often provide artifa  ...[more]

Similar Datasets

| S-EPMC3310402 | biostudies-literature
| S-EPMC10510034 | biostudies-literature
| S-EPMC4333381 | biostudies-literature
| S-EPMC5961299 | biostudies-literature
| S-EPMC9561269 | biostudies-literature
| S-EPMC1182351 | biostudies-literature
| S-EPMC6544187 | biostudies-literature
| S-EPMC10107899 | biostudies-literature
| S-EPMC4937194 | biostudies-literature
| S-EPMC5734385 | biostudies-literature