Dataset Information

A fast, lock-free approach for efficient parallel counting of occurrences of k-mers.

ABSTRACT:

Motivation

Counting the number of occurrences of every k-mer (substring of length k) in a long string is a central subproblem in many applications, including genome assembly, error correction of sequencing reads, fast multiple sequence alignment and repeat detection. Recently, the deep sequence coverage generated by next-generation sequencing technologies has caused the amount of sequence to be processed during a genome project to grow rapidly, and has rendered current k-mer counting tools too slow and memory intensive. At the same time, large multicore computers have become commonplace in research facilities allowing for a new parallel computational paradigm.

Results

We propose a new k-mer counting algorithm and associated implementation, called Jellyfish, which is fast and memory efficient. It is based on a multithreaded, lock-free hash table optimized for counting k-mers up to 31 bases in length. Due to their flexibility, suffix arrays have been the data structure of choice for solving many string problems. For the task of k-mer counting, important in many biological applications, Jellyfish offers a much faster and more memory-efficient solution.

Availability

The Jellyfish software is written in C++ and is GPL licensed. It is available for download at http://www.cbcb.umd.edu/software/jellyfish.

SUBMITTER: Marcais G

PROVIDER: S-EPMC3051319 | biostudies-literature | 2011 Mar

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

A fast, lock-free approach for efficient parallel counting of occurrences of k-mers.

Marçais Guillaume G Kingsford Carl C

Bioinformatics (Oxford, England) 20110107 6

<h4>Motivation</h4>Counting the number of occurrences of every k-mer (substring of length k) in a long string is a central subproblem in many applications, including genome assembly, error correction of sequencing reads, fast multiple sequence alignment and repeat detection. Recently, the deep sequence coverage generated by next-generation sequencing technologies has caused the amount of sequence to be processed during a genome project to grow rapidly, and has rendered current k-mer counting too ...[more]

PMID: 21217122

Similar Datasets

Project description:BackgroundExpression and purification of correctly folded proteins typically require screening of different parameters such as protein variants, solubility enhancing tags or expression hosts. Parallel vector series that cover all variations are available, but not without compromise. We have established a fast, efficient and absolutely background free cloning approach that can be applied to any selected vector.ResultsHere we describe a method to tailor selected expression vectors for parallel Sequence and Ligation Independent Cloning. SLIC cloning enables precise and sequence independent engineering and is based on joining vector and insert with 15-25 bp homologies on both DNA ends by homologous recombination. We modified expression vectors based on pET, pFastBac and pTT backbones for parallel PCR-based cloning and screening in E.coli, insect cells and HEK293E cells, respectively. We introduced the toxic ccdB gene under control of a strong constitutive promoter for counterselection of insert less vector. In contrast to DpnI treatment commonly used to reduce vector background, ccdB used in our vector series is 100% efficient in killing parental vector carrying cells and reduces vector background to zero. In addition, the 3' end of ccdB functions as a primer binding site common to all vectors. The second shared primer binding site is provided by a HRV 3C protease cleavage site located downstream of purification and solubility enhancing tags for tag removal. We have so far generated more than 30 different parallel expression vectors, and successfully cloned and expressed more than 250 genes with this vector series. There is no size restriction for gene insertion, clone efficiency is > 95% with clone numbers up to 200. The procedure is simple, fast, efficient and cost-effective. All expression vectors showed efficient expression of eGFP and different target proteins requested to be produced and purified at our Core Facility services.ConclusionThis new expression vector series allows efficient and cost-effective parallel cloning and thus screening of different protein constructs, tags and expression hosts.

Project description:Thousands of long intergenic noncoding RNAs (lincRNAs) are encoded by the mammalian genome, which were reported to have multiple biological functions as transcriptional activators acting in cis 1 or trans 2, transcriptional repressors 3,4 or miRNAs decoys 5,6. However, the function of most lincRNAs has not yet been identified in vivo. Here, we demonstrate a role for linc-MYH, a novel long intergenic noncoding RNA, in adult fast-type myofibre specialization. Skeletal myofibre fast and slow phenotypes are established through differential expression of numerous fibre-specific genes7. We show linc-MYH and the fast MYH genes share a common enhancer located in the fast MYH genes locus and regulated by the Six1 homeoproteins. Muscle-specific Six1 mutant mice show increased expression of slow-type genes, and downregulation of linc-MYH and fast-type genes. linc-MYH function revealed by in vivo knockdown and wide transcriptomic analysis, is in fine to prevent expression of genes ensuring slow muscle contractile properties, and to increase fast-type muscle gene expression in fast-type myofibres. Thus, formation of efficient fast sarcomeric units and appropriate Ca++ cycling and excitation/contraction/relaxation coupling in fast- myofibres is achieved through the coordiante control of fast MYHs and linc-MYH expression by a Six bound enhancer. Ten Î¼g of shRNA-expressing vector were introduced into TA muscles of 8 week-old mice by electroporation. Two weeks following electroporation, TA myofibres expressing GFP were dissected under a Nikon SMZ1500 stereo microscope and frozen in liquid nitrogen before processing. The efficiency of each shRNA was established by determination of linc-MYH transcript levels in TA muscles transfected by each shlincMYH. The shRNA against 5'- TTCTGCTCACCACCTACAATT-3' sequence was selected for the knockdown experiment. After validation of RNA quality with the Bioanalyzer 2100 (using Agilent RNA6000 nano chip kit), 50 ng of total RNA were reverse transcribed following the Ovation PicoSL WTA System (Nugen). Briefly, the resulting double-strand cDNA was used for amplification based on SPIA technology. After purification according to Nugen protocol, 5 Î¼g of single strand DNA was used for generation of Sens Target DNA using Ovation Exon Module kit (Nugen). 2.5 Î¼g of Sens Target DNA were fragmented and labelled with biotin using Encore Biotin Module kit (Nugen). After control of fragmentation using Bioanalyzer 2100, the cDNA was then hybridized to GeneChipÂ® Mouse Gene 1.0 ST (Affymetrix) at 45Â°C for 17 hours. After overnight hybridization, the ChIPs were washed using the fluidic station FS450 following specific protocols (Affymetrix) and scanned using the GCS3000 7G. The scanned images were then analyzed with Expression Console software (Affymetrix) to obtain raw data (cel files) and metrics for Quality Controls. The analysis of some of these metrics and the study of the distribution of raw data show no outlier experiment. Gastrocnemius muscles were collected from cSix1 KO and control mice. Total RNAs were extracted by Trizol Reagent (Invitrogen) according to manufacturer's instruction. After validation of RNA quality with the Bioanalyzer 2100 (using Agilent RNA6000 nano chip kit), 50 ng of total RNA were reverse transcribed following the Ovation PicoSL WTA System (Nugen). Briefly, the resulting double-strand cDNA was used for amplification based on SPIA technology. After purification according to Nugen protocol, 5 Î¼g of single strand DNA was used for generation of Sens Target DNA using Ovation Exon Module kit (Nugen). 2.5 Î¼g of Sens Target DNA were fragmented and labelled with biotin using Encore Biotin Module kit (Nugen). After control of fragmentation using Bioanalyzer 2100, the cDNA was then hybridized to GeneChipÂ® Mouse Gene 1.0 ST (Affymetrix) at 45Â°C for 17 hours. After overnight hybridization, the ChIPs were washed using the fluidic station FS450 following specific protocols (Affymetrix) and scanned using the GCS3000 7G. The scanned images were then analyzed with Expression Console software (Affymetrix) to obtain raw data (cel files) and metrics for Quality Controls. The analysis of some of these metrics and the study of the distribution of raw data show no outlier experiment.

Dataset Information

A fast, lock-free approach for efficient parallel counting of occurrences of k-mers.

Motivation

Results

Availability

Publications

A fast, lock-free approach for efficient parallel counting of occurrences of k-mers.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets