Dataset Information

A model-based approach to selection of tag SNPs.

ABSTRACT:

Background

Single Nucleotide Polymorphisms (SNPs) are the most common type of polymorphisms found in the human genome. Effective genetic association studies require the identification of sets of tag SNPs that capture as much haplotype information as possible. Tag SNP selection is analogous to the problem of data compression in information theory. According to Shannon's framework, the optimal tag set maximizes the entropy of the tag SNPs subject to constraints on the number of SNPs. This approach requires an appropriate probabilistic model. Compared to simple measures of Linkage Disequilibrium (LD), a good model of haplotype sequences can more accurately account for LD structure. It also provides a machinery for the prediction of tagged SNPs and thereby to assess the performances of tag sets through their ability to predict larger SNP sets.

Results

Here, we compute the description code-lengths of SNP data for an array of models and we develop tag SNP selection methods based on these models and the strategy of entropy maximization. Using data sets from the HapMap and ENCODE projects, we show that the hidden Markov model introduced by Li and Stephens outperforms the other models in several aspects: description code-length of SNP data, information content of tag sets, and prediction of tagged SNPs. This is the first use of this model in the context of tag SNP selection.

Conclusion

Our study provides strong evidence that the tag sets selected by our best method, based on Li and Stephens model, outperform those chosen by several existing methods. The results also suggest that information content evaluated with a good model is more sensitive for assessing the quality of a tagging set than the correct prediction rate of tagged SNPs. Besides, we show that haplotype phase uncertainty has an almost negligible impact on the ability of good tag sets to predict tagged SNPs. This justifies the selection of tag SNPs on the basis of haplotype informativeness, although genotyping studies do not directly assess haplotypes. A software that implements our approach is available.

SUBMITTER: Nicolas P

PROVIDER: S-EPMC1525207 | biostudies-literature | 2006 Jun

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

A model-based approach to selection of tag SNPs.

Nicolas Pierre P Sun Fengzhu F Li Lei M LM

BMC bioinformatics 20060615

<h4>Background</h4>Single Nucleotide Polymorphisms (SNPs) are the most common type of polymorphisms found in the human genome. Effective genetic association studies require the identification of sets of tag SNPs that capture as much haplotype information as possible. Tag SNP selection is analogous to the problem of data compression in information theory. According to Shannon's framework, the optimal tag set maximizes the entropy of the tag SNPs subject to constraints on the number of SNPs. This ...[more]

PMID: 16776821

Similar Datasets

Project description:BackgroundStudies on genome-wide associations help to determine the cause of many genetic diseases. Genome-wide associations typically focus on associations between single-nucleotide polymorphisms (SNPs). Genotyping every SNP in a chromosomal region for identifying genetic variation is computationally very expensive. A representative subset of SNPs, called tag SNPs, can be used to identify genetic variation. Small tag SNPs save the computation time of genotyping platform, however, there could be missing data or genotyping errors in small tag SNPs. This study aims to solve Tag SNPs selection problem using many-objective evolutionary algorithms.MethodsTag SNPs selection can be viewed as an optimization problem with some trade-offs between objectives, e.g. minimizing the number of tag SNPs and maximizing tolerance for missing data. In this study, the tag SNPs selection problem is formulated as a many-objective problem. Nondominated Sorting based Genetic Algorithm (NSGA-III), and Multi-Objective Evolutionary Algorithm based on Decomposition (MOEA/D), which are Many-Objective evolutionary algorithms, have been applied and investigated for optimal tag SNPs selection. This study also investigates different initialization methods like greedy and random initialization. optimization.ResultsThe evaluation measures used for comparing results for different algorithms are Hypervolume, Range, SumMin, MinSum, Tolerance rate, and Average Hamming distance. Overall MOEA/D algorithm gives superior results as compared to other algorithms in most cases. NSGA-III outperforms NSGA-II and other compared algorithms on maximum tolerance rate, and SPEA2 outperforms all algorithms on average hamming distance.ConclusionExperimental results show that the performance of our proposed many-objective algorithms is much superior as compared to the results of existing methods. The outcomes show the advantages of greedy initialization over random initialization using NSGA-III, SPEA2, and MOEA/D to solve the tag SNPs selection as many-objective optimization problem.

Project description:Lepidium campestre has been targeted for domestication as future oilseed and catch crop. Three hundred eighty plants comprising genotypes of L. campestre, Lepidium heterophyllum, and their interspecific F2 mapping population were genotyped using genotyping by sequencing (GBS), and the generated polymorphic markers were used for the construction of high-density genetic linkage map. TASSEL-GBS, a reference genome-based pipeline, was used for this analysis using a draft L. campestre whole genome sequence. The analysis resulted in 120,438 biallelic single-nucleotide polymorphisms (SNPs) with minor allele frequency (MAF) above 0.01. The construction of genetic linkage map was conducted using MSTMap based on phased SNPs segregating in 1:2:1 ratio for the F2 individuals, followed by genetic mapping of segregating contig tag haplotypes as dominant markers against the linkage map. The final linkage map consisted of eight linkage groups (LGs) containing 2,330 SNP markers and spanned 881 Kosambi cM. Contigs (10,302) were genetically mapped to the eight LGs, which were assembled into pseudomolecules that covered a total of ∼120.6 Mbp. The final size of the pseudomolecules ranged from 9.4 Mbp (LG-4) to 20.4 Mpb (LG-7). The following major correspondence between the eight Lepidium LGs (LG-1 to LG-8) and the five Arabidopsis thaliana (At) chromosomes (Atx-1-Atx-5) was revealed through comparative genomics analysis: LG-1&2_Atx-1, LG-3_Atx-2&3, LG-4_Atx-2, LG-5_Atx-2&Atx-3, LG-6_Atx-4&5, LG-7_Atx-4, and LG-8_Atx-5. This analysis revealed that at least 66% of the sequences of the LGs showed high collinearity with At chromosomes. The sequence identity between the corresponding regions of the LGs and At chromosomes ranged from 80.6% (LG-6) to 86.4% (LG-8) with overall mean of 82.9%. The map positions on Lepidium LGs of the homologs of 24 genes that regulate various traits in A. thaliana were also identified. The eight LGs revealed in this study confirm the previously reported (1) haploid chromosome number of eight in L. campestre and L. heterophyllum and (2) chromosomal fusion, translocation, and inversion events during the evolution of n = 8 karyotype in ancestral species shared by Lepidium and Arabidopsis to n = 5 karyotype in A. thaliana. This study generated highly useful genomic tools and resources for Lepidium that can be used to accelerate its domestication.

Dataset Information

A model-based approach to selection of tag SNPs.

Background

Results

Conclusion

Publications

A model-based approach to selection of tag SNPs.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets