Dataset Information

Evading the annotation bottleneck: using sequence similarity to search non-sequence gene data.

ABSTRACT: Non-sequence gene data (images, literature, etc.) can be found in many different public databases. Access to these data is mostly by text based methods using gene names; however, gene annotation is neither complete, nor fully systematic between organisms, and is also not generally stable over time. This provides some challenges for text based access, especially for cross-species searches. We propose a method for non-sequence data retrieval based on sequence similarity, which removes dependence on annotation and text searches. This work was motivated by the need to provide better access to large numbers of in situ images, and the observation that such image data were usually associated with a specific gene sequence. Sequence similarity searches are found in existing gene oriented databases, but mostly give indirect access to non-sequence data via navigational links.Three applications were built to explore the proposed method: accessing image data, literature and gene names. Searches are initiated with the sequence of the user's gene of interest, which is searched against a database of sequences associated with the target data. The matching (non-sequence) target data are returned directly to the user's browser, organised by sequence similarity. The method worked well for the intended application in image data management. Comparison with text based searches of the image data set showed the accuracy of the method. Applied to literature searches it facilitated retrieval of mostly high relevance references. Applied to gene name data it provided a useful analysis of name variation of related genes within and between species.This method makes a powerful and useful addition to existing methods for searching gene data based on text retrieval or curated gene lists. In particular the method facilitates cross-species comparisons, and enables the handling of novel or otherwise un-annotated genes. Applications using the method are quick and easy to build, and the data require little maintenance. This approach largely circumvents the need for annotation, which can be a major obstacle to the development of genomic scale data resources.

SUBMITTER: Gilchrist MJ

PROVIDER: S-EPMC2587480 | biostudies-literature | 2008 Oct

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Evading the annotation bottleneck: using sequence similarity to search non-sequence gene data.

Gilchrist Michael J MJ Christensen Mikkel B MB Harland Richard R Pollet Nicolas N Smith James C JC Ueno Naoto N Papalopulu Nancy N

BMC bioinformatics 20081017

<h4>Background</h4>Non-sequence gene data (images, literature, etc.) can be found in many different public databases. Access to these data is mostly by text based methods using gene names; however, gene annotation is neither complete, nor fully systematic between organisms, and is also not generally stable over time. This provides some challenges for text based access, especially for cross-species searches. We propose a method for non-sequence data retrieval based on sequence similarity, which r ...[more]

PMID: 18928517

Similar Datasets

Project description:Trypanosomatids are the causative agents of deadly diseases in humans and livestock. Given the high phylogenetic distance of trypanosomatids from model organisms, these organisms have ample unannotated genes. Manual functional annotation is time-consuming, highlighting the importance of automated functional annotation tools. The development of automated functional tools is a hot research topic, and multiple tools have been developed for the task. PANNZER2 is an automated functional annotation tool that merely relies on the sequence similarity of the query to the annotated proteins. We tried PANNZER2 on Trypanosoma brucei, the most studied organism among trypanosomatids, to see if it could improve our knowledge of the functions of the genes. Even with the availability of automated annotation tools like InterPro2GO in databases such as TriTrypDB, PANNZER2 has made surprisingly confident predictions for some hypothetical proteins in T. brucei. In this study, we identify gaps in such annotations because of not employing pairwise sequence alignment tools in TriTrypDB's automated annotation process. Our findings demonstrate that even the use of stringent cutoffs can successfully annotate a significant number of proteins. Additionally, we discovered that adjusting the open reading frames in certain genes leads to sequences with increased sequence signature coverage-characterized by the length covered by at least one sequence signature-compared to the original sequences. This enhanced sequence signature coverage suggests these genomic fragments could be pseudogenes. To facilitate further exploration, we developed a script to help identify potential pseudogenes within an organism's genome, offering researchers a new tool for genomic analysis and understanding. We extended all our analysis to Trypanosoma cruzi and Leishmania major to assess the impact of this approach across different species. Our study demonstrates that by utilizing pairwise sequence similarity alignment, even with stringent cutoffs, we can attribute 2986, 3953, and 3798 new GO terms to the genomes of T. brucei, T. cruzi, and L. major. Additionally, we found that 210, 239, and 29 genes exhibit increased sequence signature coverage following frame correction, suggesting the presence of pseudogenes.

Project description:BackgroundMicrosporidia are a large taxon of intracellular pathogens characterized by extraordinarily streamlined genomes with unusually high sequence divergence and many species-specific adaptations. These unique factors pose challenges for traditional genome annotation methods based on sequence similarity. As a result, many of the microsporidian genomes sequenced to date contain numerous genes of unknown function. Recent innovations in rapid and accurate structure prediction and comparison, together with the growing amount of data in structural databases, provide new opportunities to assist in the functional annotation of newly sequenced genomes.ResultsIn this study, we established a workflow that combines sequence and structure-based functional gene annotation approaches employing a ChimeraX plugin named ANNOTEX (Annotation Extension for ChimeraX), allowing for visual inspection and manual curation. We employed this workflow on a high-quality telomere-to-telomere sequenced tetraploid genome of Vairimorpha necatrix. First, the 3080 predicted protein-coding DNA sequences, of which 89% were confirmed with RNA sequencing data, were used as input. Next, ColabFold was used to create protein structure predictions, followed by a Foldseek search for structural matching to the PDB and AlphaFold databases. The subsequent manual curation, using sequence and structure-based hits, increased the accuracy and quality of the functional genome annotation compared to results using only traditional annotation tools. Our workflow resulted in a comprehensive description of the V. necatrix genome, along with a structural summary of the most prevalent protein groups, such as the ricin B lectin family. In addition, and to test our tool, we identified the functions of several previously uncharacterized Encephalitozoon cuniculi genes.ConclusionWe provide a new functional annotation tool for divergent organisms and employ it on a newly sequenced, high-quality microsporidian genome to shed light on this uncharacterized intracellular pathogen of Lepidoptera. The addition of a structure-based annotation approach can serve as a valuable template for studying other microsporidian or similarly divergent species.

Dataset Information

Evading the annotation bottleneck: using sequence similarity to search non-sequence gene data.

Publications

Evading the annotation bottleneck: using sequence similarity to search non-sequence gene data.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets