Unknown

Dataset Information

0

Discovery and genotyping of novel sequence insertions in many sequenced individuals.


ABSTRACT: Despite recent advances in algorithms design to characterize structural variation using high-throughput short read sequencing (HTS) data, characterization of novel sequence insertions longer than the average read length remains a challenging task. This is mainly due to both computational difficulties and the complexities imposed by genomic repeats in generating reliable assemblies to accurately detect both the sequence content and the exact location of such insertions. Additionally, de novo genome assembly algorithms typically require a very high depth of coverage, which may be a limiting factor for most genome studies. Therefore, characterization of novel sequence insertions is not a routine part of most sequencing projects.Here, we present Pamir, a new algorithm to efficiently and accurately discover and genotype novel sequence insertions using either single or multiple genome sequencing datasets. Pamir is able to detect breakpoint locations of the insertions and calculate their zygosity (i.e. heterozygous versus homozygous) by analyzing multiple sequence signatures, matching one-end-anchored sequences to small-scale de novo assemblies of unmapped reads, and conducting strand-aware local assembly. We test the efficacy of Pamir on both simulated and real data, and demonstrate its potential use in accurate and routine identification of novel sequence insertions in genome projects.Pamir is available at https://github.com/vpc-ccg/pamir .fhach@{sfu.ca, prostatecentre.com } or calkan@cs.bilkent.edu.tr.Supplementary data are available at Bioinformatics online.

SUBMITTER: Kavak P 

PROVIDER: S-EPMC5870608 | biostudies-literature | 2017 Jul

REPOSITORIES: biostudies-literature

altmetric image

Publications

Discovery and genotyping of novel sequence insertions in many sequenced individuals.

Kavak Pinar P   Lin Yen-Yi YY   Numanagic Ibrahim I   Asghari Hossein H   Güngör Tunga T   Alkan Can C   Hach Faraz F  

Bioinformatics (Oxford, England) 20170701 14


<h4>Motivation</h4>Despite recent advances in algorithms design to characterize structural variation using high-throughput short read sequencing (HTS) data, characterization of novel sequence insertions longer than the average read length remains a challenging task. This is mainly due to both computational difficulties and the complexities imposed by genomic repeats in generating reliable assemblies to accurately detect both the sequence content and the exact location of such insertions. Additio  ...[more]

Similar Datasets

| S-EPMC4417122 | biostudies-literature
| S-EPMC3824118 | biostudies-literature
| S-EPMC10225079 | biostudies-literature
| S-EPMC4782575 | biostudies-literature
| S-EPMC10381067 | biostudies-literature
| S-EPMC5411763 | biostudies-literature
| S-EPMC2865866 | biostudies-literature
| S-EPMC3360789 | biostudies-literature
| S-EPMC8906366 | biostudies-literature
| S-EPMC4980031 | biostudies-literature