Dataset Information

Suffix tree searcher: exploration of common substrings in large DNA sequence sets.

ABSTRACT: Large DNA sequence data sets require special bioinformatics tools to search and compare them. Such tools should be easy to use so that the data can be easily accessed by a wide array of researchers. In the past, the use of suffix trees for searching DNA sequences has been limited by a practical need to keep the trees in RAM. Newer algorithms solve this problem by using disk-based approaches. However, none of the fastest suffix tree algorithms have been implemented with a graphical user interface, preventing their incorporation into a feasible laboratory workflow.Suffix Tree Searcher (STS) is designed as an easy-to-use tool to index, search, and analyze very large DNA sequence datasets. The program accommodates very large numbers of very large sequences, with aggregate size reaching tens of billions of nucleotides. The program makes use of pre-sorted persistent "building blocks" to reduce the time required to construct new trees. STS is comprised of a graphical user interface written in Java, and four C modules. All components are automatically downloaded when a web link is clicked. The underlying suffix tree data structure permits extremely fast searching for specific nucleotide strings, with wild cards or mismatches allowed. Complete tree traversals for detecting common substrings are also very fast. The graphical user interface allows the user to transition seamlessly between building, traversing, and searching the dataset.Thus, STS provides a new resource for the detection of substrings common to multiple DNA sequences or within a single sequence, for truly huge data sets. The re-searching of sequence hits, allowing wild card positions or mismatched nucleotides, together with the ability to rapidly retrieve large numbers of sequence hits from the DNA sequence files, provides the user with an efficient method of evaluating the similarity between nucleotide sequences by multiple alignment or use of Logos. The ability to re-use existing suffix tree pieces considerably shortens index generation time. The graphical user interface enables quick mastery of the analysis functions, easy access to the generated data, and seamless workflow integration.

SUBMITTER: Minkley D

PROVIDER: S-EPMC4118789 | biostudies-literature | 2014 Jul

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Suffix tree searcher: exploration of common substrings in large DNA sequence sets.

Minkley David D Whitney Michael J MJ Lin Song-Han SH Barsky Marina G MG Kelly Chris C Upton Chris C

BMC research notes 20140723

<h4>Background</h4>Large DNA sequence data sets require special bioinformatics tools to search and compare them. Such tools should be easy to use so that the data can be easily accessed by a wide array of researchers. In the past, the use of suffix trees for searching DNA sequences has been limited by a practical need to keep the trees in RAM. Newer algorithms solve this problem by using disk-based approaches. However, none of the fastest suffix tree algorithms have been implemented with a graph ...[more]

PMID: 25053142

Dataset Information

Suffix tree searcher: exploration of common substrings in large DNA sequence sets.

Publications

Suffix tree searcher: exploration of common substrings in large DNA sequence sets.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

Sequence comparison alignment-free approach based on suffix tree and L-words frequency.
| S-EPMC3444837 | biostudies-literature

Parallel Generalized Suffix Tree Construction for Genomic Data
| S-EPMC7197101 | biostudies-literature

A practical algorithm for finding maximal exact matches in large sequence datasets using sparse suffix arrays.
| S-EPMC2732316 | biostudies-literature

Pluribus-Exploring the Limits of Error Correction Using a Suffix Tree.
| S-EPMC5754272 | biostudies-literature

CellProfiler Analyst: interactive data exploration, analysis and classification of large biological image sets.
| S-EPMC5048071 | biostudies-literature

GHOSTX: an improved sequence homology search algorithm using a query suffix array and a database suffix array.
| S-EPMC4123905 | biostudies-literature

Parallel and private generalized suffix tree construction and query on genomic data
| S-EPMC9206251 | biostudies-literature

FastSKAT: Sequence kernel association tests for very large sets of markers.
| S-EPMC6129408 | biostudies-literature

Repulsive parallel MCMC algorithm for discovering diverse motifs from large sequence sets.
| S-EPMC4426842 | biostudies-literature

Confirming the phylogeny of mammals by use of large comparative sequence data sets.
| S-EPMC2515873 | biostudies-literature