Unknown

Dataset Information

0

An integrative variant analysis pipeline for accurate genotype/haplotype inference in population NGS data.


ABSTRACT: Next-generation sequencing is a powerful approach for discovering genetic variation. Sensitive variant calling and haplotype inference from population sequencing data remain challenging. We describe methods for high-quality discovery, genotyping, and phasing of SNPs for low-coverage (approximately 5×) sequencing of populations, implemented in a pipeline called SNPTools. Our pipeline contains several innovations that specifically address challenges caused by low-coverage population sequencing: (1) effective base depth (EBD), a nonparametric statistic that enables more accurate statistical modeling of sequencing data; (2) variance ratio scoring, a variance-based statistic that discovers polymorphic loci with high sensitivity and specificity; and (3) BAM-specific binomial mixture modeling (BBMM), a clustering algorithm that generates robust genotype likelihoods from heterogeneous sequencing data. Last, we develop an imputation engine that refines raw genotype likelihoods to produce high-quality phased genotypes/haplotypes. Designed for large population studies, SNPTools' input/output (I/O) and storage aware design leads to improved computing performance on large sequencing data sets. We apply SNPTools to the International 1000 Genomes Project (1000G) Phase 1 low-coverage data set and obtain genotyping accuracy comparable to that of SNP microarray.

SUBMITTER: Wang Y 

PROVIDER: S-EPMC3638139 | biostudies-literature | 2013 May

REPOSITORIES: biostudies-literature

altmetric image

Publications

An integrative variant analysis pipeline for accurate genotype/haplotype inference in population NGS data.

Wang Yi Y   Lu James J   Yu Jin J   Gibbs Richard A RA   Yu Fuli F  

Genome research 20130107 5


Next-generation sequencing is a powerful approach for discovering genetic variation. Sensitive variant calling and haplotype inference from population sequencing data remain challenging. We describe methods for high-quality discovery, genotyping, and phasing of SNPs for low-coverage (approximately 5×) sequencing of populations, implemented in a pipeline called SNPTools. Our pipeline contains several innovations that specifically address challenges caused by low-coverage population sequencing: (1  ...[more]

Similar Datasets

| S-EPMC3266881 | biostudies-literature
| S-EPMC6882857 | biostudies-literature
| S-EPMC6461034 | biostudies-literature
| S-EPMC385088 | biostudies-other
| S-EPMC4456389 | biostudies-literature
| S-EPMC2265661 | biostudies-other
| S-EPMC6698887 | biostudies-literature
| S-EPMC6756534 | biostudies-literature
| S-EPMC2835450 | biostudies-literature
| S-EPMC2647951 | biostudies-literature