Dataset Information

Next generation models for storage and representation of microbial biological annotation.

ABSTRACT: BACKGROUND: Traditional genome annotation systems were developed in a very different computing era, one where the World Wide Web was just emerging. Consequently, these systems are built as centralized black boxes focused on generating high quality annotation submissions to GenBank/EMBL supported by expert manual curation. The exponential growth of sequence data drives a growing need for increasingly higher quality and automatically generated annotation. Typical annotation pipelines utilize traditional database technologies, clustered computing resources, Perl, C, and UNIX file systems to process raw sequence data, identify genes, and predict and categorize gene function. These technologies tightly couple the annotation software system to hardware and third party software (e.g. relational database systems and schemas). This makes annotation systems hard to reproduce, inflexible to modification over time, difficult to assess, difficult to partition across multiple geographic sites, and difficult to understand for those who are not domain experts. These systems are not readily open to scrutiny and therefore not scientifically tractable. The advent of Semantic Web standards such as Resource Description Framework (RDF) and OWL Web Ontology Language (OWL) enables us to construct systems that address these challenges in a new comprehensive way. RESULTS: Here, we develop a framework for linking traditional data to OWL-based ontologies in genome annotation. We show how data standards can decouple hardware and third party software tools from annotation pipelines, thereby making annotation pipelines easier to reproduce and assess. An illustrative example shows how TURTLE (Terse RDF Triple Language) can be used as a human readable, but also semantically-aware, equivalent to GenBank/EMBL files. CONCLUSIONS: The power of this approach lies in its ability to assemble annotation data from multiple databases across multiple locations into a representation that is understandable to researchers. In this way, all researchers, experimental and computational, will more easily understand the informatics processes constructing genome annotation and ultimately be able to help improve the systems that produce them.

SUBMITTER: Quest DJ

PROVIDER: S-EPMC3026362 | biostudies-literature | 2010

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Next generation models for storage and representation of microbial biological annotation.

Quest Daniel J DJ Land Miriam L ML Brettin Thomas S TS Cottingham Robert W RW

BMC bioinformatics 20101007

<h4>Background</h4>Traditional genome annotation systems were developed in a very different computing era, one where the World Wide Web was just emerging. Consequently, these systems are built as centralized black boxes focused on generating high quality annotation submissions to GenBank/EMBL supported by expert manual curation. The exponential growth of sequence data drives a growing need for increasingly higher quality and automatically generated annotation. Typical annotation pipelines utiliz ...[more]

PMID: 20946598

Similar Datasets

Project description:BackgroundThe glucosyltransferase UGT76G1 from Stevia rebaudiana is a chameleon enzyme in the targeted biosynthesis of the next-generation premium stevia sweeteners, rebaudioside D (Reb D) and rebaudioside M (Reb M). These steviol glucosides carry five and six glucose units, respectively, and have low sweetness thresholds, high maximum sweet intensities and exhibit a greatly reduced lingering bitter taste compared to stevioside and rebaudioside A, the most abundant steviol glucosides in the leaves of Stevia rebaudiana.ResultsIn the metabolic glycosylation grid leading to production of Reb D and Reb M, UGT76G1 was found to catalyze eight different reactions all involving 1,3-glucosylation of steviol C 13- and C 19-bound glucoses. Four of these reactions lead to Reb D and Reb M while the other four result in formation of side-products unwanted for production. In this work, side-product formation was reduced by targeted optimization of UGT76G1 towards 1,3 glucosylation of steviol glucosides that are already 1,2-diglucosylated. The optimization of UGT76G1 was based on homology modelling, which enabled identification of key target amino acids present in the substrate-binding pocket. These residues were then subjected to site-saturation mutagenesis and a mutant library containing a total of 1748 UGT76G1 variants was screened for increased accumulation of Reb D or M, as well as for decreased accumulation of side-products. This screen was performed in a Saccharomyces cerevisiae strain expressing all enzymes in the rebaudioside biosynthesis pathway except for UGT76G1.ConclusionsScreening of the mutant library identified mutations with positive impact on the accumulation of Reb D and Reb M. The effect of the introduced mutations on other reactions in the metabolic grid was characterized. This screen made it possible to identify variants, such as UGT76G1Thr146Gly and UGT76G1His155Leu, which diminished accumulation of unwanted side-products and gave increased specific accumulation of the desired Reb D or Reb M sweeteners. This improvement in a key enzyme of the Stevia sweetener biosynthesis pathway represents a significant step towards the commercial production of next-generation stevia sweeteners.

Project description:The rhesus macaque (Macaca mulatta) is a key species for advancing biomedical research. Like all draft mammalian genomes, the draft rhesus assembly (rheMac2) has gaps, sequencing errors and misassemblies that have prevented automated annotation pipelines from functioning correctly. Another rhesus macaque assembly, CR_1.0, is also available but is substantially more fragmented than rheMac2 with smaller contigs and scaffolds. Annotations for these two assemblies are limited in completeness and accuracy. High quality assembly and annotation files are required for a wide range of studies including expression, genetic and evolutionary analyses.We report a new de novo assembly of the rhesus macaque genome (MacaM) that incorporates both the original Sanger sequences used to assemble rheMac2 and new Illumina sequences from the same animal. MacaM has a weighted average (N50) contig size of 64 kilobases, more than twice the size of the rheMac2 assembly and almost five times the size of the CR_1.0 assembly. The MacaM chromosome assembly incorporates information from previously unutilized mapping data and preliminary annotation of scaffolds. Independent assessment of the assemblies using Ion Torrent read alignments indicates that MacaM is more complete and accurate than rheMac2 and CR_1.0. We assembled messenger RNA sequences from several rhesus tissues into transcripts which allowed us to identify a total of 11,712 complete proteins representing 9,524 distinct genes. Using a combination of our assembled rhesus macaque transcripts and human transcripts, we annotated 18,757 transcripts and 16,050 genes with complete coding sequences in the MacaM assembly. Further, we demonstrate that the new annotations provide greatly improved accuracy as compared to the current annotations of rheMac2. Finally, we show that the MacaM genome provides an accurate resource for alignment of reads produced by RNA sequence expression studies.The MacaM assembly and annotation files provide a substantially more complete and accurate representation of the rhesus macaque genome than rheMac2 or CR_1.0 and will serve as an important resource for investigators conducting next-generation sequencing studies with nonhuman primates.This article was reviewed by Dr. Lutz Walter, Dr. Soojin Yi and Dr. Kateryna Makova.

Dataset Information

Next generation models for storage and representation of microbial biological annotation.

Publications

Next generation models for storage and representation of microbial biological annotation.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets