Dataset Information

Genome comparison without alignment using shortest unique substrings.

ABSTRACT: BACKGROUND: Sequence comparison by alignment is a fundamental tool of molecular biology. In this paper we show how a number of sequence comparison tasks, including the detection of unique genomic regions, can be accomplished efficiently without an alignment step. Our procedure for nucleotide sequence comparison is based on shortest unique substrings. These are substrings which occur only once within the sequence or set of sequences analysed and which cannot be further reduced in length without losing the property of uniqueness. Such substrings can be detected using generalized suffix trees. RESULTS: We find that the shortest unique substrings in Caenorhabditis elegans, human and mouse are no longer than 11 bp in the autosomes of these organisms. In mouse and human these unique substrings are significantly clustered in upstream regions of known genes. Moreover, the probability of finding such short unique substrings in the genomes of human or mouse by chance is extremely small. We derive an analytical expression for the null distribution of shortest unique substrings, given the GC-content of the query sequences. Furthermore, we apply our method to rapidly detect unique genomic regions in the genome of Staphylococcus aureus strain MSSA476 compared to four other staphylococcal genomes. CONCLUSION: We combine a method to rapidly search for shortest unique substrings in DNA sequences and a derivation of their null distribution. We show that unique regions in an arbitrary sample of genomes can be efficiently detected with this method. The corresponding programs shustring (SHortest Unique subSTRING) and shulen are written in C and available at http://adenine.biz.fh-weihenstephan.de/shustring/.

SUBMITTER: Haubold B

PROVIDER: S-EPMC1166540 | biostudies-literature | 2005

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Genome comparison without alignment using shortest unique substrings.

Haubold Bernhard B Pierstorff Nora N Möller Friedrich F Wiehe Thomas T

BMC bioinformatics 20050523

<h4>Background</h4>Sequence comparison by alignment is a fundamental tool of molecular biology. In this paper we show how a number of sequence comparison tasks, including the detection of unique genomic regions, can be accomplished efficiently without an alignment step. Our procedure for nucleotide sequence comparison is based on shortest unique substrings. These are substrings which occur only once within the sequence or set of sequences analysed and which cannot be further reduced in length wi ...[more]

PMID: 15910684

Dataset Information

Genome comparison without alignment using shortest unique substrings.

Publications

Genome comparison without alignment using shortest unique substrings.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

Comparison of alignment software for genome-wide bisulphite sequence data.
| S-EPMC3378906 | biostudies-other

Cell shape characterization, alignment, and comparison using FlowShape.
| S-EPMC10307944 | biostudies-literature

Improving pan-genome annotation using whole genome multiple alignment.
| S-EPMC3142524 | biostudies-literature

Fast alignment-free sequence comparison using spaced-word frequencies.
| S-EPMC4080745 | biostudies-literature

2011 German Escherichia coli O104:H4 outbreak: whole-genome phylogeny without alignment.
| S-EPMC3280199 | biostudies-literature

Matching sensor ontologies through siamese neural networks without using reference alignment.
| S-EPMC8237319 | biostudies-literature

Quasi-prime peptides: identification of the shortest peptide sequences unique to a species.
| S-EPMC10124967 | biostudies-literature

Alignment-free genomic sequence comparison using FCGR and signal processing.
| S-EPMC6937637 | biostudies-literature

Alignment-free genome comparison enables accurate geographic sourcing of white oak DNA.
| S-EPMC6288960 | biostudies-literature

iPBA: a tool for protein structure comparison using sequence alignment strategies.
| S-EPMC3125758 | biostudies-literature