Dataset Information

Utilizing the Jaccard index to reveal population stratification in sequencing data: a simulation study and an application to the 1000 Genomes Project.

ABSTRACT: Population stratification is one of the major sources of confounding in genetic association studies, potentially causing false-positive and false-negative results. Here, we present a novel approach for the identification of population substructure in high-density genotyping data/next generation sequencing data. The approach exploits the co-appearances of rare genetic variants in individuals. The method can be applied to all available genetic loci and is computationally fast. Using sequencing data from the 1000 Genomes Project, the features of the approach are illustrated and compared to existing methodology (i.e. EIGENSTRAT). We examine the effects of different cutoffs for the minor allele frequency on the performance of the approach. We find that our approach works particularly well for genetic loci with very small minor allele frequencies. The results suggest that the inclusion of rare-variant data/sequencing data in our approach provides a much higher resolution picture of population substructure than it can be obtained with existing methodology. Furthermore, in simulation studies, we find scenarios where our method was able to control the type 1 error more precisely and showed higher power.dmitry.prokopenko@uni-bonn.deSupplementary data are available at Bioinformatics online.

SUBMITTER: Prokopenko D

PROVIDER: S-EPMC5860507 | biostudies-other | 2016 May

REPOSITORIES: biostudies-other

ACCESS DATA

Publications

Utilizing the Jaccard index to reveal population stratification in sequencing data: a simulation study and an application to the 1000 Genomes Project.

Prokopenko Dmitry D Hecker Julian J Silverman Edwin K EK Pagano Marcello M Nöthen Markus M MM Dina Christian C Lange Christoph C Fier Heide Loehlein HL

Bioinformatics (Oxford, England) 20151231 9

<h4>Motivation</h4>Population stratification is one of the major sources of confounding in genetic association studies, potentially causing false-positive and false-negative results. Here, we present a novel approach for the identification of population substructure in high-density genotyping data/next generation sequencing data. The approach exploits the co-appearances of rare genetic variants in individuals. The method can be applied to all available genetic loci and is computationally fast. U ...[more]

PMID: 26722118

Dataset Information

Utilizing the Jaccard index to reveal population stratification in sequencing data: a simulation study and an application to the 1000 Genomes Project.

Publications

Utilizing the Jaccard index to reveal population stratification in sequencing data: a simulation study and an application to the 1000 Genomes Project.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

Population Stratification and Underrepresentation of Indian Subcontinent Genetic Diversity in the 1000 Genomes Project Dataset.
| S-EPMC5203783 | biostudies-literature

Copy Number Variation detection from 1000 Genomes Project exon capture sequencing data.
| S-EPMC3563612 | biostudies-literature

Mycoplasma contamination in the 1000 Genomes Project.
| S-EPMC4022254 | biostudies-literature

Evaluation of MC1R high-throughput nucleotide sequencing data generated by the 1000 Genomes Project.
| S-EPMC5488459 | biostudies-literature

Low frequency variants, collapsed based on biological knowledge, uncover complexity of population stratification in 1000 genomes project data.
| S-EPMC3873241 | biostudies-literature

High-coverage whole-genome sequencing of the expanded 1000 Genomes Project cohort including 602 trios.
| S-EPMC9439720 | biostudies-literature

The 1000 Genomes Project: data management and community access.
| S-EPMC3340611 | biostudies-literature

A pharmacogene database enhanced by the 1000 Genomes Project.
| S-EPMC2935084 | biostudies-literature