Dataset Information

SlopMap: a software application tool for quick and flexible identification of similar sequences using exact k-mer matching.

ABSTRACT: With the advent of Next-Generation (NG) sequencing, it has become possible to sequence a entire genomes quickly and inexpensively. However, in some experiments one only needs to extract and assembly a portion of the sequence reads, for example when performing transcriptome studies, sequencing mitochondrial genomes, or characterizing exomes. With the raw DNA-library of a complete genome it would appear to be a trivial problem to identify reads of interest. But it is not always easy to incorporate well-known tools such as BLAST, BLAT, Bowtie, and SOAP directly into a bioinformatics pipelines before the assembly stage, either due to incompatibility with the assembler's file inputs, or because it is desirable to incorporate information that must be extracted separately. For example, in order to incorporate flowgrams from a Roche 454 sequencer into the Newbler assembler it is necessary to first extract them from the original SFF files. We present SlopMap, a bioinformatics software utility that allows quickly identification similar to the provided reference reads from either Roche 454 or Illumnia DNA library. With simple and intuitive command-line interface along with file output formats compatible to assembly programs, SlopMap can be directly embedded to biological data processing pipeline without any additional programming work. In addition, SlopMap preserves flowgram information needed for Roche 454 assembler.

SUBMITTER: Zhbannikov IY

PROVIDER: S-EPMC3882202 | biostudies-literature | 2013 Aug

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

SlopMap: a software application tool for quick and flexible identification of similar sequences using exact k-mer matching.

Zhbannikov Ilya Y IY Hunter Samuel S SS Settles Matthew L ML Foster James A JA

Journal of data mining in genomics & proteomics 20130801 3

With the advent of Next-Generation (NG) sequencing, it has become possible to sequence a entire genomes quickly and inexpensively. However, in some experiments one only needs to extract and assembly a portion of the sequence reads, for example when performing transcriptome studies, sequencing mitochondrial genomes, or characterizing exomes. With the raw DNA-library of a complete genome it would appear to be a trivial problem to identify reads of interest. But it is not always easy to incorporate ...[more]

PMID: 24404406

Similar Datasets

Project description:This article presents efficient quantum solutions for exact multiple pattern matching to process the biological sequences. The classical solution takes Ο(mN) time for matching m patterns over N sized text database. The quantum search mechanism is a core for pattern matching, as this reduces time complexity and achieves computational speedup. Few quantum methods are available for multiple pattern matching, which executes search oracle for each pattern in successive iterations. Such solutions are likely acceptable because of classical equivalent quantum designs. However, these methods are constrained with the inclusion of multiplicative factor m in their complexities. An optimal quantum design is to execute multiple search oracle in parallel on the quantum processing unit with a single-core that completely removes the multiplicative factor m, however, this method is impractical to design. We have no effective quantum solutions to process multiple patterns at present. Therefore, we propose quantum algorithms using quantum processing unit with C quantum cores working on shared quantum memory. This quantum parallel design would be effective for searching all t exact occurrences of each pattern. To our knowledge, no attempts have been made to design multiple pattern matching algorithms on quantum multicore processor. Thus, some quantum remarkable exact single pattern matching algorithms are enhanced here with their equivalent versions, namely enhanced quantum memory processing based exact algorithm and enhanced quantum-based combined exact algorithm for multiple pattern matching. Our quantum solutions find all t exact occurrences of each pattern inside the biological sequence in O((m/C)N) and O((m/C)t) time complexities. This article shows the hybrid simulation of quantum algorithms to validate quantum solutions. Our theoretical-experimental results justify the significant improvements that these algorithms outperform over the existing classical solutions and are proven effective in quantum counterparts.

Project description:BackgroundIdentifying viral sequences in mixed metagenomes containing both viral and host contigs is a critical first step in analyzing the viral component of samples. Current tools for distinguishing prokaryotic virus and host contigs primarily use gene-based similarity approaches. Such approaches can significantly limit results especially for short contigs that have few predicted proteins or lack proteins with similarity to previously known viruses.MethodsWe have developed VirFinder, the first k-mer frequency based, machine learning method for virus contig identification that entirely avoids gene-based similarity searches. VirFinder instead identifies viral sequences based on our empirical observation that viruses and hosts have discernibly different k-mer signatures. VirFinder's performance in correctly identifying viral sequences was tested by training its machine learning model on sequences from host and viral genomes sequenced before 1 January 2014 and evaluating on sequences obtained after 1 January 2014.ResultsVirFinder had significantly better rates of identifying true viral contigs (true positive rates (TPRs)) than VirSorter, the current state-of-the-art gene-based virus classification tool, when evaluated with either contigs subsampled from complete genomes or assembled from a simulated human gut metagenome. For example, for contigs subsampled from complete genomes, VirFinder had 78-, 2.4-, and 1.8-fold higher TPRs than VirSorter for 1, 3, and 5 kb contigs, respectively, at the same false positive rates as VirSorter (0, 0.003, and 0.006, respectively), thus VirFinder works considerably better for small contigs than VirSorter. VirFinder furthermore identified several recently sequenced virus genomes (after 1 January 2014) that VirSorter did not and that have no nucleotide similarity to previously sequenced viruses, demonstrating VirFinder's potential advantage in identifying novel viral sequences. Application of VirFinder to a set of human gut metagenomes from healthy and liver cirrhosis patients reveals higher viral diversity in healthy individuals than cirrhosis patients. We also identified contig bins containing crAssphage-like contigs with higher abundance in healthy patients and a putative Veillonella genus prophage associated with cirrhosis patients.ConclusionsThis innovative k-mer based tool complements gene-based approaches and will significantly improve prokaryotic viral sequence identification, especially for metagenomic-based studies of viral ecology.

Project description:The string-matching paradigm is applied in every computer science and science branch in general. The existence of a plethora of string-matching algorithms makes it hard to choose the best one for any particular case. Expressing, measuring, and testing algorithm efficiency is a challenging task with many potential pitfalls. Algorithm efficiency can be measured based on the usage of different resources. In software engineering, algorithmic productivity is a property of an algorithm execution identified with the computational resources the algorithm consumes. Resource usage in algorithm execution could be determined, and for maximum efficiency, the goal is to minimize resource usage. Guided by the fact that standard measures of algorithm efficiency, such as execution time, directly depend on the number of executed actions. Without touching the problematics of computer power consumption or memory, which also depends on the algorithm type and the techniques used in algorithm development, we have developed a methodology which enables the researchers to choose an efficient algorithm for a specific domain. String searching algorithms efficiency is usually observed independently from the domain texts being searched. This research paper aims to present the idea that algorithm efficiency depends on the properties of searched string and properties of the texts being searched, accompanied by the theoretical analysis of the proposed approach. In the proposed methodology, algorithm efficiency is expressed through character comparison count metrics. The character comparison count metrics is a formal quantitative measure independent of algorithm implementation subtleties and computer platform differences. The model is developed for a particular problem domain by using appropriate domain data (patterns and texts) and provides for a specific domain the ranking of algorithms according to the patterns' entropy. The proposed approach is limited to on-line exact string-matching problems based on information entropy for a search pattern. Meticulous empirical testing depicts the methodology implementation and purports soundness of the methodology.

Project description:BackgroundCarbohydrates play a critical role in human diseases and their potential utility as biomarkers for pathological conditions is a major driver for characterization of the glycome. However, the additional complexity of glycans compared to proteins and nucleic acids has slowed the advancement of glycomics in comparison to genomics and proteomics. The branched nature of carbohydrates, the great diversity of their constituents and the numerous alternative symbolic notations, make the input and display of glycans not as straightforward as for example the amino-acid sequence of a protein. Every glycoinformatic tool providing a user interface would benefit from a fast, intuitive, appealing mechanism for input and output of glycan structures in a computer readable format.ResultsA software tool for building and displaying glycan structures using a chosen symbolic notation is described here. The "GlycanBuilder" uses an automatic rendering algorithm to draw the saccharide symbols and to place them on the drawing board. The information about the symbolic notation is derived from a configurable graphical model as a set of rules governing the aspect and placement of residues and linkages. The algorithm is able to represent a structure using only few traversals of the tree and is inherently fast. The tool uses an XML format for import and export of encoded structures.ConclusionThe rendering algorithm described here is able to produce high-quality representations of glycan structures in a chosen symbolic notation. The automated rendering process enables the "GlycanBuilder" to be used both as a user-independent component for displaying glycans and as an easy-to-use drawing tool. The "GlycanBuilder" can be integrated in web pages as a Java applet for the visual editing of glycans. The same component is available as a web service to render an encoded structure into a graphical format. Finally, the "GlycanBuilder" can be integrated into other applications to create intuitive and appealing user interfaces: an example is the "GlycoWorkbench", a software tool for assisted annotation of glycan mass spectra. The "GlycanBuilder" represent a flexible, reliable and efficient solution to the problem of input and output of glycan structures in any glycomic tool or database.

Project description:BackgroundThe identification of patterns in biological sequences is a key challenge in genome analysis and in proteomics. Frequently such patterns are complex and highly variable, especially in protein sequences. They are frequently described using terms of regular expressions (RegEx) because of the user-friendly terminology. Limitations arise for queries with the increasing complexity of patterns and are accompanied by requirements for enhanced capabilities. This is especially true for patterns containing ambiguous characters and positions and/or length ambiguities.ResultsWe have implemented the 3of5 web application in order to enable complex pattern matching in protein sequences. 3of5 is named after a special use of its main feature, the novel n-of-m pattern type. This feature allows for an extensive specification of variable patterns where the individual elements may vary in their position, order, and content within a defined stretch of sequence. The number of distinct elements can be constrained by operators, and individual characters may be excluded. The n-of-m pattern type can be combined with common regular expression terms and thus also allows for a comprehensive description of complex patterns. 3of5 increases the fidelity of pattern matching and finds ALL possible solutions in protein sequences in cases of length-ambiguous patterns instead of simply reporting the longest or shortest hits. Grouping and combined search for patterns provides a hierarchical arrangement of larger patterns sets. The algorithm is implemented as internet application and freely accessible. The application is available at http://dkfz.de/mga2/3of5/3of5.html.ConclusionThe 3of5 application offers an extended vocabulary for the definition of search patterns and thus allows the user to comprehensively specify and identify peptide patterns with variable elements. The n-of-m pattern type offers an improved accuracy for pattern matching in combination with the ability to find all solutions, without compromising the user friendliness of regular expression terms.

Dataset Information

SlopMap: a software application tool for quick and flexible identification of similar sequences using exact k-mer matching.

Publications

SlopMap: a software application tool for quick and flexible identification of similar sequences using exact k-mer matching.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets