Dataset Information

Discovery and genotyping of novel sequence insertions in many sequenced individuals.

ABSTRACT: Despite recent advances in algorithms design to characterize structural variation using high-throughput short read sequencing (HTS) data, characterization of novel sequence insertions longer than the average read length remains a challenging task. This is mainly due to both computational difficulties and the complexities imposed by genomic repeats in generating reliable assemblies to accurately detect both the sequence content and the exact location of such insertions. Additionally, de novo genome assembly algorithms typically require a very high depth of coverage, which may be a limiting factor for most genome studies. Therefore, characterization of novel sequence insertions is not a routine part of most sequencing projects.Here, we present Pamir, a new algorithm to efficiently and accurately discover and genotype novel sequence insertions using either single or multiple genome sequencing datasets. Pamir is able to detect breakpoint locations of the insertions and calculate their zygosity (i.e. heterozygous versus homozygous) by analyzing multiple sequence signatures, matching one-end-anchored sequences to small-scale de novo assemblies of unmapped reads, and conducting strand-aware local assembly. We test the efficacy of Pamir on both simulated and real data, and demonstrate its potential use in accurate and routine identification of novel sequence insertions in genome projects.Pamir is available at https://github.com/vpc-ccg/pamir .fhach@{sfu.ca, prostatecentre.com } or calkan@cs.bilkent.edu.tr.Supplementary data are available at Bioinformatics online.

SUBMITTER: Kavak P

PROVIDER: S-EPMC5870608 | biostudies-literature | 2017 Jul

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Discovery and genotyping of novel sequence insertions in many sequenced individuals.

Kavak Pinar P Lin Yen-Yi YY Numanagic Ibrahim I Asghari Hossein H Güngör Tunga T Alkan Can C Hach Faraz F

Bioinformatics (Oxford, England) 20170701 14

<h4>Motivation</h4>Despite recent advances in algorithms design to characterize structural variation using high-throughput short read sequencing (HTS) data, characterization of novel sequence insertions longer than the average read length remains a challenging task. This is mainly due to both computational difficulties and the complexities imposed by genomic repeats in generating reliable assemblies to accurately detect both the sequence content and the exact location of such insertions. Additio ...[more]

PMID: 28881988

Similar Datasets

Project description:BackgroundAcross species, diversity at the Major Histocompatibility Complex (MHC) is critical to disease resistance and population health; however, use of MHC diversity to quantify the genetic health of populations has been hampered by the extreme variation found in MHC genes. Next generation sequencing (NGS) technology generates sufficient data to genotype even the most diverse species, but workflows for distinguishing artifacts from alleles are still under development. We used NGS to evaluate the MHC diversity of over 300 captive and wild ring-tailed lemurs (Lemur catta: Primates: Mammalia). We modified a published workflow to address errors that arise from deep sequencing individuals and tested for evidence of selection at the most diverse MHC genes.ResultsIn addition to evaluating the accuracy of 454 Titanium and Ion Torrent PGM for genotyping large populations at hypervariable genes, we suggested modifications to improve current methods of allele calling. Using these modifications, we genotyped 302 out of 319 individuals, obtaining an average sequencing depth of over 1000 reads per amplicon. We identified 55 MHC-DRB alleles, 51 of which were previously undescribed, and provide the first sequences of five additional MHC genes: DOA, DOB, DPA, DQA, and DRA. The additional five MHC genes had one or two alleles each with little sequence variation; however, the 55 MHC-DRB alleles showed a high dN/dS ratio and trans-species polymorphism, indicating a history of positive selection. Because each individual possessed 1-7 MHC-DRB alleles, we suggest that ring-tailed lemurs have four, putatively functional, MHC-DRB copies.ConclusionsIn the future, accurate genotyping methods for NGS data will be critical to assessing genetic variation in non-model species. We recommend that future NGS studies increase the proportion of replicated samples, both within and across platforms, particularly for hypervariable genes like the MHC. Quantifying MHC diversity within non-model species is the first step to assessing the relationship of genetic diversity at functional loci to individual fitness and population viability. Owing to MHC-DRB diversity and copy number, ring-tailed lemurs may serve as an ideal model for estimating the interaction between genetic diversity, fitness, and environment, especially regarding endangered species.

Dataset Information

Discovery and genotyping of novel sequence insertions in many sequenced individuals.

Publications

Discovery and genotyping of novel sequence insertions in many sequenced individuals.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets