Dataset Information

These are not the k-mers you are looking for: efficient online k-mer counting using a probabilistic data structure.

ABSTRACT: K-mer abundance analysis is widely used for many purposes in nucleotide sequence analysis, including data preprocessing for de novo assembly, repeat detection, and sequencing coverage estimation. We present the khmer software package for fast and memory efficient online counting of k-mers in sequencing data sets. Unlike previous methods based on data structures such as hash tables, suffix arrays, and trie structures, khmer relies entirely on a simple probabilistic data structure, a Count-Min Sketch. The Count-Min Sketch permits online updating and retrieval of k-mer counts in memory which is necessary to support online k-mer analysis algorithms. On sparse data sets this data structure is considerably more memory efficient than any exact data structure. In exchange, the use of a Count-Min Sketch introduces a systematic overcount for k-mers; moreover, only the counts, and not the k-mers, are stored. Here we analyze the speed, the memory usage, and the miscount rate of khmer for generating k-mer frequency distributions and retrieving k-mer counts for individual k-mers. We also compare the performance of khmer to several other k-mer counting packages, including Tallymer, Jellyfish, BFCounter, DSK, KMC, Turtle and KAnalyze. Finally, we examine the effectiveness of profiling sequencing error, k-mer abundance trimming, and digital normalization of reads in the context of high khmer false positive rates. khmer is implemented in C++ wrapped in a Python interface, offers a tested and robust API, and is freely available under the BSD license at github.com/ged-lab/khmer.

SUBMITTER: Zhang Q

PROVIDER: S-EPMC4111482 | biostudies-literature | 2014

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

These are not the k-mers you are looking for: efficient online k-mer counting using a probabilistic data structure.

Zhang Qingpeng Q Pell Jason J Canino-Koning Rosangela R Howe Adina Chuang AC Brown C Titus CT

PloS one 20140725 7

K-mer abundance analysis is widely used for many purposes in nucleotide sequence analysis, including data preprocessing for de novo assembly, repeat detection, and sequencing coverage estimation. We present the khmer software package for fast and memory efficient online counting of k-mers in sequencing data sets. Unlike previous methods based on data structures such as hash tables, suffix arrays, and trie structures, khmer relies entirely on a simple probabilistic data structure, a Count-Min Ske ...[more]

PMID: 25062443

Similar Datasets

Project description:Joint alignment and secondary structure prediction of two RNA sequences can significantly improve the accuracy of the structural predictions. Methods addressing this problem, however, are forced to employ constraints that reduce computation by restricting the alignments and/or structures (i.e. folds) that are permissible. In this paper, a new methodology is presented for the purpose of establishing alignment constraints based on nucleotide alignment and insertion posterior probabilities. Using a hidden Markov model, posterior probabilities of alignment and insertion are computed for all possible pairings of nucleotide positions from the two sequences. These alignment and insertion posterior probabilities are additively combined to obtain probabilities of co-incidence for nucleotide position pairs. A suitable alignment constraint is obtained by thresholding the co-incidence probabilities. The constraint is integrated with Dynalign, a free energy minimization algorithm for joint alignment and secondary structure prediction. The resulting method is benchmarked against the previous version of Dynalign and against other programs for pairwise RNA structure prediction.The proposed technique eliminates manual parameter selection in Dynalign and provides significant computational time savings in comparison to prior constraints in Dynalign while simultaneously providing a small improvement in the structural prediction accuracy. Savings are also realized in memory. In experiments over a 5S RNA dataset with average sequence length of approximately 120 nucleotides, the method reduces computation by a factor of 2. The method performs favorably in comparison to other programs for pairwise RNA structure prediction: yielding better accuracy, on average, and requiring significantly lesser computational resources.Probabilistic analysis can be utilized in order to automate the determination of alignment constraints for pairwise RNA structure prediction methods in a principled fashion. These constraints can reduce the computational and memory requirements of these methods while maintaining or improving their accuracy of structural prediction. This extends the practical reach of these methods to longer length sequences. The revised Dynalign code is freely available for download.

Project description:Additive and multiplicative regression models of habituation were compared regarding the fit to looking times from a habituation experiment with infants aged between 3 and 11 months. In contrast to earlier studies, the current study considered multiple probability distributions, namely Weibull, gamma, lognormal and normal distribution. In the habituation experiment the type of contrast between the habituation and the test trial was varied (luminance, color or orientation contrast), crossed with the number of habituation trials (1, 3, 5, or 7 habituation trials) and crossed with three age cohorts (4, 7, 10 months). The initial mean LT to dark stimuli (around 3.7 s) was considerably shorter than the mean LT to green and gray stimuli (around 5 s). Infants showed the strongest dishabituation to changes from dark to bright (luminance contrast) and weak-to-no dishabituation to a 90-degrees rotation of the gray stimuli (orientation contrast). The dishabituation was stronger after five and seven habituation trials, but the result was not statistically robust. The gamma distribution showed the best fit in terms of log-likelihood and mean absolute error and the best predictive performance. Furthermore, the gamma distribution showed small correlations between parameters relative to other models. The normal additive model showed an inferior fit and medium correlations between the parameters. In particular, the positive correlation between the initial looking time (LT) and the habituation rate was likely responsible for a different interpretation relative to the multiplicative models of the main effect of age on the habituation rate. Otherwise, the additive and multiplicative models provided similar statistical conclusions. The performance of the model versions without pooling and with partial pooling across participants (also called random-effects, multi-level or hierarchical models) were compared. The latter type of models showed worse data fit but more precise predictions and reduced correlations between the parameters. The performance of model variants with auto-regressive time structures were explored but showed considerably worse fit. The performance of quadratic models that allowed non-monotonic changes in LTs were investigated as well. However, when fitted with LT data, these models did not produce non-monotonic change in LTs. The study underscores the utility of partial-pooling models in terms of providing more accurate predictions. Further, it agrees with previous research in that a multiplicative LT model is preferable. Nevertheless, the current results suggest that the impact of the choice of an additive model on the statistical inference is less dramatic then previously assumed.

Dataset Information

These are not the k-mers you are looking for: efficient online k-mer counting using a probabilistic data structure.

Publications

These are not the k-mers you are looking for: efficient online k-mer counting using a probabilistic data structure.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets