Dataset Information

Full-length messenger RNA sequences greatly improve genome annotation.

ABSTRACT: BACKGROUND:Annotation of eukaryotic genomes is a complex endeavor that requires the integration of evidence from multiple, often contradictory, sources. With the ever-increasing amount of genome sequence data now available, methods for accurate identification of large numbers of genes have become urgently needed. In an effort to create a set of very high-quality gene models, we used the sequence of 5,000 full-length gene transcripts from Arabidopsis to re-annotate its genome. We have mapped these transcripts to their exact chromosomal locations and, using alignment programs, have created gene models that provide a reference set for this organism. RESULTS:Approximately 35% of the transcripts indicated that previously annotated genes needed modification, and 5% of the transcripts represented newly discovered genes. We also discovered that multiple transcription initiation sites appear to be much more common than previously known, and we report numerous cases of alternative mRNA splicing. We include a comparison of different alignment software and an analysis of how the transcript data improved the previously published annotation. CONCLUSIONS:Our results demonstrate that sequencing of large numbers of full-length transcripts followed by computational mapping greatly improves identification of the complete exon structures of eukaryotic genes. In addition, we are able to find numerous introns in the untranslated regions of the genes.

SUBMITTER: Haas BJ

PROVIDER: S-EPMC116726 | biostudies-literature | 2002

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Full-length messenger RNA sequences greatly improve genome annotation.

Haas Brian J BJ Volfovsky Natalia N Town Christopher D CD Troukhan Maxim M Alexandrov Nickolai N Feldmann Kenneth A KA Flavell Richard B RB White Owen O Salzberg Steven L SL

Genome biology 20020530 6

<h4>Background</h4>Annotation of eukaryotic genomes is a complex endeavor that requires the integration of evidence from multiple, often contradictory, sources. With the ever-increasing amount of genome sequence data now available, methods for accurate identification of large numbers of genes have become urgently needed. In an effort to create a set of very high-quality gene models, we used the sequence of 5,000 full-length gene transcripts from Arabidopsis to re-annotate its genome. We have map ...[more]

PMID: 12093376

Dataset Information

Full-length messenger RNA sequences greatly improve genome annotation.

Publications

Full-length messenger RNA sequences greatly improve genome annotation.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

Whole genome sequence comparisons and "full-length" cDNA sequences: a combined approach to evaluate and improve Arabidopsis genome annotation.
| S-EPMC353228 | biostudies-literature

Full-length HLA class II sequences
| PRJEB42469 | ENA

Full-length transcriptome assembly from RNA-Seq data without a reference genome.
| S-EPMC3571712 | biostudies-literature

Full-Length Genome Sequences of Two Chinese Porcine Circovirus Type 3 Strains, NWHEB21 and NWHUN2.
| S-EPMC5814483 | biostudies-literature

High molecular diversity of full-length genome sequences of zucchini yellow fleck virus from Europe.
| S-EPMC9556397 | biostudies-literature

Functional annotation of 19,841 Populus nigra full-length enriched cDNA clones.
| S-EPMC2222646 | biostudies-literature

Leveraging histone modifications to improve genome annotation
2021-04-26 | GSE160944 | GEO

Ribosome reinitiation can explain length-dependent translation of messenger RNA.
| S-EPMC5482490 | biostudies-literature

Genome-wide characterization of the biggest grass, bamboo, based on 10,608 putative full-length cDNA sequences.
| S-EPMC3017805 | biostudies-literature

Full-Length Genome Sequences of Senecavirus A from Recent Idiopathic Vesicular Disease Outbreaks in U.S. Swine.
| S-EPMC4999945 | biostudies-literature