Dataset Information

Estimation of haplotype frequencies, linkage-disequilibrium measures, and combination of haplotype copies in each pool by use of pooled DNA data.

ABSTRACT: Inference of haplotypes is important for many genetic approaches, including the process of assigning a phenotype to a genetic region. Usually, the population frequencies of haplotypes, as well as the diplotype configuration of each subject, are estimated from a set of genotypes of the subjects in a sample from the population. We have developed an algorithm to infer haplotype frequencies and the combination of haplotype copies in each pool by using pooled DNA data. The input data are the genotypes in pooled DNA samples, each of which contains the quantitative genotype data from one to six subjects. The algorithm infers by the maximum-likelihood method both frequencies of the haplotypes in the population and the combination of haplotype copies in each pool by an expectation-maximization algorithm. The algorithm was implemented in the computer program LDPooled. We also used the bootstrap method to calculate the standard errors of the estimated haplotype frequencies. Using this program, we analyzed the published genotype data for the SAA (n=156), MTHFR (n=80), and NAT2 (n=116) genes, as well as the smoothelin gene (n=102). Our study has shown that the frequencies of major (frequency >0.1 in a population) haplotypes can be inferred rather accurately from the pooled DNA data by the maximum-likelihood method, although with some limitations. The estimated D and D' values had large variations except when the /D/ values were >0.1. The estimated linkage-disequilibrium measure rho2 for 36 linked loci of the smoothelin gene when one- and two-subject pool protocols were used suggested that the gross pattern of the distribution of the measure can be reproduced using the two-subject pool data.

SUBMITTER: Ito T

PROVIDER: S-EPMC379231 | biostudies-other | 2003 Feb

REPOSITORIES: biostudies-other

ACCESS DATA

Similar Datasets

Project description:BackgroundIn addition to heterogeneity and artificial selection, natural selection is one of the forces used to combat climate change and improve agrobiodiversity in evolutionary plant breeding. Accurate identification of the specific genomic effects of natural selection will likely accelerate transfer between populations. Thus, insights into changes in allele frequency, adequate population size, gene flow and drift are essential. However, observing such effects often involves a trade-off between costs and resolution when a large sample of genotypes for many loci is analysed. Pool genotyping approaches achieve high resolution and precision in estimating allele frequency when sequence coverage is high. Nevertheless, high-coverage pool sequencing of large genomes is expensive.ResultsThree pool samples (n = 300, 300, 288) from a barley backcross population were generated to assess the population's allele frequency. The tested population (BC2F21) has undergone 18 generations of natural adaption to conventional farming practice. The accuracies of estimated pool-based allele frequencies and genome coverage yields were compared using three next-generation sequencing genotyping methods. To achieve accurate allele frequency estimates with low sequence coverage, we employed a haplotyping approach. Low coverage allele frequencies of closely located single polymorphisms were aggregated into a single haplotype allele frequency, yielding 2-to-271-times higher depth and increased precision. When we combined different haplotyping tactics, we found that gene and chip marker-based haplotype analyses performed equivalently or better compared with simple contig haplotype windows. Comparing multiple pool samples and referencing against an individual sequencing approach revealed that whole-genome pool re-sequencing (WGS) achieved the highest correlation with individual genotyping (≥ 0.97). In contrast, transcriptome-based genotyping (MACE) and genotyping by sequencing (GBS) pool replicates were significantly associated with higher error rates and lower correlations, but are still valuable to detect large allele frequency variations.ConclusionsThe proposed strategy identified the allele frequency of populations with high accuracy at low cost. This is particularly relevant to evolutionary plant breeding of crops with very large genomes, such as barley. Whole-genome low coverage re-sequencing at 0.03 × coverage per genotype accurately estimated the allele frequency when a loci-based haplotyping approach was applied. The implementation of annotated haplotypes capitalises on the biological background and statistical robustness.

Project description:Linkage disequilibrium (LD) analysis provides information on the evolutionary aspects of populations. Recently, haplotype blocks have been used to increase the power of quantitative trait loci detection in genome-wide association studies and the prediction accuracy of genomic selection. Our objectives were as follows: to compare the degree of LD, LD decay, and LD decay extent in popcorn populations; to characterize the number and length of haplotype blocks in the populations; and to determine whether maize chromosomes also have a pattern of interspaced regions of high and low rates of recombination. We used a biparental population, a synthetic, and a breeding population, genotyped for approximately 75,000 single nucleotide polymorphisms (SNPs). The sample size ranged from 190 to 192 plants. For the whole-genome LD and haplotype block analyses, we assumed a window of 500 kb. To characterize the block and step patterns of LD in the populations, we constructed LD maps by chromosome, defining a cold spot as a chromosome segment including SNPs with the same LDU position. The LD and haplotype block analyses were also performed at the intragenic level, selecting 12 genes related to zein, starch, cellulose, and fatty acid biosynthesis. The populations with the higher and lower frequencies of |D'| values greater than 0.75 were the biparental (65-74%) and the breeding population (26-58%), respectively. There were slight differences between the populations regarding the average distance for SNPs with |D'| values greater than 0.75 (in the range of approximately 207 to 229 kb). The level of LD expressed by the r2 values was low in the populations (0.02, 0.04, and 0.04, on average) but comparable to some non-isolated human populations. The frequency of r2 values greater than 0.75 was lower in the biparental population (0.2-0.5%) and higher in the other populations (0.2-1.6%). The average distance for SNPs with r2 values greater than 0.75 was much higher in the biparental population (approximately 80 to 126 kb). In the other populations, the ranges were approximately 6 to 19 and 6 to 35 kb. The heatmaps for the regions covered by the first 100 SNPs in each chromosome, in each population (1 to 3.3 Mb, approximately), provided evidence that the comparatively few high r2 values (close to 1.0) occurred only for SNPs in close proximity, especially in the synthetic and breeding populations. Due to the reduced number of SNPs in the haplotype blocks (2 to 3) in the populations, it is not expected advantage of a haplotype-based association study as well as genomic selection along generations. The results concerning LD decay (rapid decay after 5-10 kb) and LD decay extent (along up to 300 kb) are in the range observed with maize inbred line panels. The LD maps indicate that maize chromosomes had a pattern of regions of extensive LD interspaced with regions of low LD. However, our simulated LD map provides evidence that this pattern can reflect regions with differences in allele frequencies and LD levels (expressed by |D'|) and not regions with high and low rates of recombination.

Dataset Information

Estimation of haplotype frequencies, linkage-disequilibrium measures, and combination of haplotype copies in each pool by use of pooled DNA data.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets