Dataset Information

Blockwise HMM computation for large-scale population genomic inference.

ABSTRACT:

Motivation

A promising class of methods for large-scale population genomic inference use the conditional sampling distribution (CSD), which approximates the probability of sampling an individual with a particular DNA sequence, given that a collection of sequences from the population has already been observed. The CSD has a wide range of applications, including imputing missing sequence data, estimating recombination rates, inferring human colonization history and identifying tracts of distinct ancestry in admixed populations. Most well-used CSDs are based on hidden Markov models (HMMs). Although computationally efficient in principle, methods resulting from the common implementation of the relevant HMM techniques remain intractable for large genomic datasets.

Results

To address this issue, a set of algorithmic improvements for performing the exact HMM computation is introduced here, by exploiting the particular structure of the CSD and typical characteristics of genomic data. It is empirically demonstrated that these improvements result in a speedup of several orders of magnitude for large datasets and that the speedup continues to increase with the number of sequences. The optimized algorithms can be adopted in methods for various applications, including the ones mentioned above and make previously impracticable analyses possible.

Availability

Software available upon request.

Supplementary information

Supplementary data are available at Bioinformatics online.

Contact

yss@eecs.berkeley.edu.

SUBMITTER: Paul JS

PROVIDER: S-EPMC3400961 | biostudies-literature | 2012 Aug

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Blockwise HMM computation for large-scale population genomic inference.

Paul Joshua S JS Song Yun S YS

Bioinformatics (Oxford, England) 20120528 15

<h4>Motivation</h4>A promising class of methods for large-scale population genomic inference use the conditional sampling distribution (CSD), which approximates the probability of sampling an individual with a particular DNA sequence, given that a collection of sequences from the population has already been observed. The CSD has a wide range of applications, including imputing missing sequence data, estimating recombination rates, inferring human colonization history and identifying tracts of di ...[more]

PMID: 22641715

Dataset Information

Blockwise HMM computation for large-scale population genomic inference.

Motivation

Results

Availability

Supplementary information

Contact

Publications

Blockwise HMM computation for large-scale population genomic inference.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

Statistical inference with large-scale trait imputation.
| S-EPMC10848238 | biostudies-literature

Smoothed Quantile Regression with Large-Scale Inference.
| S-EPMC9912996 | biostudies-literature

Evolutionary Insights from a Large-scale Survey of Population-genomic Variation.
| S-EPMC10187179 | biostudies-literature

Evolutionary Insights from a Large-Scale Survey of Population-Genomic Variation.
| S-EPMC10630549 | biostudies-literature

Synaptic computation underlying probabilistic inference.
| S-EPMC2921378 | biostudies-literature

Efficient gene orthology inference via large-scale rearrangements.
| S-EPMC10540461 | biostudies-literature

Powerful large scale inference in high dimensional mediation analysis.
| S-EPMC12829953 | biostudies-literature

Efficient inference of population size histories and locus-specific mutation rates from large-sample genomic variation data.
| S-EPMC4315300 | biostudies-literature

Population genomic inference of recombination rates and hotspots.
| S-EPMC2669376 | biostudies-literature

Computation and resource efficient genome-wide association analysis for large-scale imaging studies.
| S-EPMC12642728 | biostudies-literature