Dataset Information

Re-annotation of genome microbial coding-sequences: finding new genes and inaccurately annotated genes.

ABSTRACT: BACKGROUND:Analysis of any newly sequenced bacterial genome starts with the identification of protein-coding genes. Despite the accumulation of multiple complete genome sequences, which provide useful comparisons with close relatives among other organisms during the annotation process, accurate gene prediction remains quite difficult. A major reason for this situation is that genes are tightly packed in prokaryotes, resulting in frequent overlap. Thus, detection of translation initiation sites and/or selection of the correct coding regions remain difficult unless appropriate biological knowledge (about the structure of a gene) is imbedded in the approach. RESULTS:We have developed a new program that automatically identifies biologically significant candidate genes in a bacterial genome. Twenty-six complete prokaryotic genomes were analyzed using this tool, and the accuracy of gene finding was assessed by comparison with existing annotations. This analysis revealed that, despite the enormous effort of genome program annotators, a small but not negligible number of genes annotated within the framework of sequencing projects are likely to be partially inaccurate or plainly wrong. Moreover, the analysis of several putative new genes shows that, as expected, many short genes have escaped annotation. In most cases, these new genes revealed frameshifts that could be either artifacts or genuine frameshifts. Some entirely unexpected new genes have also been identified. This allowed us to get a more complete picture of prokaryotic genomes. The results of this procedure are progressively integrated into the SWISS-PROT reference databank. CONCLUSIONS:The results described in the present study show that our procedure is very satisfactory in terms of gene finding accuracy. Except in few cases, discrepancies between our results and annotations provided by individual authors can be accounted for by the nature of each annotation process or by specific characteristics of some genomes. This stresses that close cooperation between scientists, regular update and curation of the findings in databases are clearly required to reduce the level of errors in genome annotation (and hence in reducing the unfortunate spreading of errors through centralized data libraries).

SUBMITTER: Bocs S

PROVIDER: S-EPMC77393 | biostudies-literature | 2002

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Re-annotation of genome microbial coding-sequences: finding new genes and inaccurately annotated genes.

Bocs Stéphanie S Danchin Antoine A Médigue Claudine C

BMC bioinformatics 20020205

<h4>Background</h4>Analysis of any newly sequenced bacterial genome starts with the identification of protein-coding genes. Despite the accumulation of multiple complete genome sequences, which provide useful comparisons with close relatives among other organisms during the annotation process, accurate gene prediction remains quite difficult. A major reason for this situation is that genes are tightly packed in prokaryotes, resulting in frequent overlap. Thus, detection of translation initiation ...[more]

PMID: 11879526

Dataset Information

Re-annotation of genome microbial coding-sequences: finding new genes and inaccurately annotated genes.

Publications

Re-annotation of genome microbial coding-sequences: finding new genes and inaccurately annotated genes.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

An integrative method for identifying the over-annotated protein-coding genes in microbial genomes.
| S-EPMC3223076 | biostudies-literature

Re-Annotator: Annotation Pipeline for Microarray Probe Sequences.
| S-EPMC4591122 | biostudies-literature

Accurate annotation of protein coding sequences with IDTAXA.
| S-EPMC8445202 | biostudies-literature

Finding protein-coding genes through human polymorphisms.
| S-EPMC3551959 | biostudies-literature

Ribosome profiling reveals pervasive translation outside of annotated protein-coding genes.
| S-EPMC4216110 | biostudies-literature

Finding new proteins encoded by long non-coding RNAs in human cells
2019-07-20 | GSE79539 | GEO

Re-assembly, quality evaluation, and annotation of 678 microbial eukaryotic reference transcriptomes.
| S-EPMC6481552 | biostudies-literature

Finding new human minisatellite sequences in the vicinity of long CA-rich sequences.
| S-EPMC310796 | biostudies-literature

Re-annotation of protein-coding genes in 10 complete genomes of Neisseriaceae family by combining similarity-based and composition-based methods.
| S-EPMC3686433 | biostudies-literature

Genome-wide annotation of protein-coding genes in pig.
| S-EPMC8788080 | biostudies-literature