Dataset Information

Identification of prokaryotic small proteins using a comparative genomic approach.

ABSTRACT:

Motivation

Accurate prediction of genes encoding small proteins (on the order of 50 amino acids or less) remains an elusive open problem in bioinformatics. Some of the best methods for gene prediction use either sequence composition analysis or sequence similarity to a known protein coding sequence. These methods often fail for small proteins, however, either due to a lack of experimentally verified small protein coding genes or due to the limited statistical significance of statistics on small sequences. Our approach is based upon the hypothesis that true small proteins will be under selective pressure for encoding the particular amino acid sequence, for ease of translation by the ribosome and for structural stability. This stability can be achieved either independently or as part of a larger protein complex. Given this assumption, it follows that small proteins should display conserved local protein structure properties much like larger proteins. Our method incorporates neural-net predictions for three local structure alphabets within a comparative genomic approach using a genomic alignment of 22 closely related bacteria genomes to generate predictions for whether or not a given open reading frame (ORF) encodes for a small protein.

Results

We have applied this method to the complete genome for Escherichia coli strain K12 and looked at how well our method performed on a set of 60 experimentally verified small proteins from this organism. Out of a total of 11 407 possible ORFs, we found that 6 of the top 10 and 27 of the top 100 predictions belonged to the set of 60 experimentally verified small proteins. We found 35 of all the true small proteins within the top 200 predictions. We compared our method to Glimmer, using a default Glimmer protocol and a modified small ORF Glimmer protocol with a lower minimum size cutoff. The default Glimmer protocol identified 16 of the true small proteins (all in the top 200 predictions), but failed to predict on 34 due to size cutoffs. The small ORF Glimmer protocol made predictions for all the experimentally verified small proteins but only contained 9 of the 60 true small proteins within the top 200 predictions.

Contact

jsamayoa@jhu.edu

SUBMITTER: Samayoa J

PROVIDER: S-EPMC3117347 | biostudies-literature | 2011 Jul

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Identification of prokaryotic small proteins using a comparative genomic approach.

Samayoa Josue J Yildiz Fitnat H FH Karplus Kevin K

Bioinformatics (Oxford, England) 20110505 13

<h4>Motivation</h4>Accurate prediction of genes encoding small proteins (on the order of 50 amino acids or less) remains an elusive open problem in bioinformatics. Some of the best methods for gene prediction use either sequence composition analysis or sequence similarity to a known protein coding sequence. These methods often fail for small proteins, however, either due to a lack of experimentally verified small protein coding genes or due to the limited statistical significance of statistics o ...[more]

PMID: 21551138

Similar Datasets

Project description:With mounting availability of genomic and phenotypic databases, data integration and mining become increasingly challenging. While efforts have been put forward to analyze prokaryotic phenotypes, current computational technologies either lack high throughput capacity for genomic scale analysis, or are limited in their capability to integrate and mine data across different scales of biology. Consequently, simultaneous analysis of associations among genomes, phenotypes, and gene functions is prohibited. Here, we developed a high throughput computational approach, and demonstrated for the first time the feasibility of integrating large quantities of prokaryotic phenotypes along with genomic datasets for mining across multiple scales of biology (protein domains, pathways, molecular functions, and cellular processes). Applying this method over 59 fully sequenced prokaryotic species, we identified genetic basis and molecular mechanisms underlying the phenotypes in bacteria. We identified 3,711 significant correlations between 1,499 distinct Pfam and 63 phenotypes, with 2,650 correlations and 1,061 anti-correlations. Manual evaluation of a random sample of these significant correlations showed a minimal precision of 30% (95% confidence interval: 20%-42%; n = 50). We stratified the most significant 478 predictions and subjected 100 to manual evaluation, of which 60 were corroborated in the literature. We furthermore unveiled 10 significant correlations between phenotypes and KEGG pathways, eight of which were corroborated in the evaluation, and 309 significant correlations between phenotypes and 166 GO concepts evaluated using a random sample (minimal precision = 72%; 95% confidence interval: 60%-80%; n = 50). Additionally, we conducted a novel large-scale phenomic visualization analysis to provide insight into the modular nature of common molecular mechanisms spanning multiple biological scales and reused by related phenotypes (metaphenotypes). We propose that this method elucidates which classes of molecular mechanisms are associated with phenotypes or metaphenotypes and holds promise in facilitating a computable systems biology approach to genomic and biomedical research.

Project description:BackgroundInnate immune genes tend to be highly conserved in metazoans, even in early divergent lineages such as Cnidaria (jellyfish, corals, hydroids and sea anemones) and Porifera (sponges). However, constant and diverse selection pressures on the immune system have driven the expansion and diversification of different immune gene families in a lineage-specific manner. To investigate how the innate immune system has evolved in a subset of sea anemone species (Order: Actiniaria), we performed a comprehensive and comparative study using 10 newly sequenced transcriptomes, as well as three publically available transcriptomes, to identify the origins, expansions and contractions of candidate and novel immune gene families.ResultsWe characterised five conserved genes and gene families, as well as multiple novel innate immune genes, including the newly recognised putative pattern recognition receptor CniFL. Single copies of TLR, MyD88 and NF-κB were found in most species, and several copies of IL-1R-like, NLR and CniFL were found in almost all species. Multiple novel immune genes were identified with domain architectures including the Toll/interleukin-1 receptor (TIR) homology domain, which is well documented as functioning in protein-protein interactions and signal transduction in immune pathways. We hypothesise that these genes may interact as novel proteins in immune pathways of cnidarian species. Novelty in the actiniarian immunome is not restricted to only TIR-domain-containing proteins, as we identify a subset of NLRs which have undergone neofunctionalisation and contain 3-5 N-terminal transmembrane domains, which have so far only been identified in two anthozoan species.ConclusionsThis research has significance in understanding the evolution and origin of the core eumetazoan gene set, including how novel innate immune genes evolve. For example, the evolution of transmembrane domain containing NLRs indicates that these NLRs may be membrane-bound, while all other metazoan and plant NLRs are exclusively cytosolic receptors. This is one example of how species without an adaptive immune system may evolve innovative solutions to detect pathogens or interact with native microbiota. Overall, these results provide an insight into the evolution of the innate immune system, and show that early divergent lineages, such as actiniarians, have a diverse repertoire of conserved and novel innate immune genes.

Dataset Information

Identification of prokaryotic small proteins using a comparative genomic approach.

Motivation

Results

Contact

Publications

Identification of prokaryotic small proteins using a comparative genomic approach.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets