Dataset Information

Dictionary-driven prokaryotic gene finding.

ABSTRACT: Gene identification, also known as gene finding or gene recognition, is among the important problems of molecular biology that have been receiving increasing attention with the advent of large scale sequencing projects. Previous strategies for solving this problem can be categorized into essentially two schools of thought: one school employs sequence composition statistics, whereas the other relies on database similarity searches. In this paper, we propose a new gene identification scheme that combines the best characteristics from each of these two schools. In particular, our method determines gene candidates among the ORFs that can be identified in a given DNA strand through the use of the Bio-Dictionary, a database of patterns that covers essentially all of the currently available sample of the natural protein sequence space. Our approach relies entirely on the use of redundant patterns as the agents on which the presence or absence of genes is predicated and does not employ any additional evidence, e.g. ribosome-binding site signals. The Bio-Dictionary Gene Finder (BDGF), the algorithm's implementation, is a single computational engine able to handle the gene identification task across distinct archaeal and bacterial genomes. The engine exhibits performance that is characterized by simultaneous very high values of sensitivity and specificity, and a high percentage of correctly predicted start sites. Using a collection of patterns derived from an old (June 2000) release of the Swiss-Prot/TrEMBL database that contained 451 602 proteins and fragments, we demonstrate our method's generality and capabilities through an extensive analysis of 17 complete archaeal and bacterial genomes. Examples of previously unreported genes are also shown and discussed in detail.

SUBMITTER: Shibuya T

PROVIDER: S-EPMC117281 | biostudies-literature | 2002 Jun

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Dictionary-driven prokaryotic gene finding.

Shibuya Tetsuo T Rigoutsos Isidore I

Nucleic acids research 20020601 12

Gene identification, also known as gene finding or gene recognition, is among the important problems of molecular biology that have been receiving increasing attention with the advent of large scale sequencing projects. Previous strategies for solving this problem can be categorized into essentially two schools of thought: one school employs sequence composition statistics, whereas the other relies on database similarity searches. In this paper, we propose a new gene identification scheme that c ...[more]

PMID: 12060689

Dataset Information

Dictionary-driven prokaryotic gene finding.

Publications

Dictionary-driven prokaryotic gene finding.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

MetaGene: prokaryotic gene finding from environmental genome shotgun sequences.
| S-EPMC1636498 | biostudies-literature

Prokaryotic gene finding based on physicochemical characteristics of codons calculated from molecular dynamics simulations.
| S-EPMC2480686 | biostudies-literature

LeadMine: a grammar and dictionary driven approach to entity recognition.
| S-EPMC4331695 | biostudies-literature

Finding functional associations between prokaryotic virus orthologous groups: a proof of concept.
| S-EPMC8442406 | biostudies-literature

Estimating geographic subjective well-being from Twitter: A comparison of dictionary and data-driven language methods.
| S-EPMC7229753 | biostudies-literature

The Glycan Structure Dictionary-a dictionary describing commonly used glycan structure terms.
| S-EPMC10243773 | biostudies-literature

Gene finding in novel genomes.
| S-EPMC421630 | biostudies-literature

Gene finding in metatranscriptomic sequences.
| S-EPMC4168707 | biostudies-literature

Dictionary-enhanced imaging cytometry.
| S-EPMC5320489 | biostudies-literature

Reconstitution of DNA segregation driven by assembly of a prokaryotic actin homolog.
| S-EPMC2851738 | biostudies-literature