Dataset Information

Identification of repeat structure in large genomes using repeat probability clouds.

ABSTRACT: The identification of repeat structure in eukaryotic genomes can be time-consuming and difficult because of the large amount of information ( approximately 3 x 10(9) bp) that needs to be processed and compared. We introduce a new approach based on exact word counts to evaluate, de novo, the repeat structure present within large eukaryotic genomes. This approach avoids sequence alignment and similarity search, two of the most time-consuming components of traditional methods for repeat identification. Algorithms were implemented to efficiently calculate exact counts for any length oligonucleotide in large genomes. Based on these oligonucleotide counts, oligonucleotide excess probability clouds, or "P-clouds," were constructed. P-clouds are composed of clusters of related oligonucleotides that occur, as a group, more often than expected by chance. After construction, P-clouds were mapped back onto the genome, and regions of high P-cloud density were identified as repetitive regions based on a sliding window approach. This efficient method is capable of analyzing the repeat content of the entire human genome on a single desktop computer in less than half a day, at least 10-fold faster than current approaches. The predicted repetitive regions strongly overlap with known repeat elements as well as other repetitive regions such as gene families, pseudogenes, and segmental duplicons. This method should be extremely useful as a tool for use in de novo identification of repeat structure in large newly sequenced genomes.

SUBMITTER: Gu W

PROVIDER: S-EPMC2533575 | biostudies-literature | 2008 Sep

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Identification of repeat structure in large genomes using repeat probability clouds.

Gu Wanjun W Castoe Todd A TA Hedges Dale J DJ Batzer Mark A MA Pollock David D DD

Analytical biochemistry 20080520 1

The identification of repeat structure in eukaryotic genomes can be time-consuming and difficult because of the large amount of information ( approximately 3 x 10(9) bp) that needs to be processed and compared. We introduce a new approach based on exact word counts to evaluate, de novo, the repeat structure present within large eukaryotic genomes. This approach avoids sequence alignment and similarity search, two of the most time-consuming components of traditional methods for repeat identificat ...[more]

PMID: 18541131

Dataset Information

Identification of repeat structure in large genomes using repeat probability clouds.

Publications

Identification of repeat structure in large genomes using repeat probability clouds.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

Identifying repeat domains in large genomes.
| S-EPMC1431705 | biostudies-literature

Repeat characterization in large plant genomes
| PRJEB34435 | ENA

Probability-based protein secondary structure identification using combined NMR chemical-shift data.
| S-EPMC2373532 | biostudies-literature

Large scale in silico characterization of repeat expansion variation in human genomes.
| S-EPMC7479135 | biostudies-literature

Identification of large-scale genomic variation in cancer genomes using in silico reference models.
| S-EPMC4705683 | biostudies-literature

MITE Tracker: an accurate approach to identify miniature inverted-repeat transposable elements in large genomes.
| S-EPMC6171319 | biostudies-literature

Divergent copies of the large inverted repeat in the chloroplast genomes of ulvophycean green algae.
| S-EPMC5430533 | biostudies-literature

Large Differences in the Haptophyte <i>Phaeocystis globosa</i> Mitochondrial Genomes Driven by Repeat Amplifications.
| S-EPMC8283788 | biostudies-literature

Enrichment of G4DNA and a Large Inverted Repeat Coincide in the Mitochondrial Genomes of Termitomyces.
| S-EPMC6609731 | biostudies-literature

OPERA-LG: efficient and exact scaffolding of large, repeat-rich eukaryotic genomes with performance guarantees.
| S-EPMC4864936 | biostudies-literature