Dataset Information

GO functional similarity clustering depends on similarity measure, clustering method, and annotation completeness.

ABSTRACT:

Background

Biological knowledge, and therefore Gene Ontology annotation sets, for human genes is incomplete. Recent studies have reported that biases in available GO annotations result in biased estimates of functional similarities of genes, but it is still unclear what the effect of incompleteness itself may be, even in the absence of bias. Pairwise gene similarities are used in a number of contexts, including gene "functional similarity" clustering and the related problem of functional ontology structure inference, but it is not known how different similarity measures or clustering methods perform on this task, and how the clusters are affected by annotation completeness.

Results

We developed representations of both "complete" and "incomplete" GO annotation datasets based on experimentally-supported annotations from the GO database-specifically designed to model the incompleteness of human gene annotations-and computed semantic similarities for each set using a variety of different published measures. We then assessed the clusters derived from these measures using two different clustering methods: hierarchical clustering, and the CliXO algorithm. We find the CliXO algorithm, combined with appropriate measures, performs better than hierarchical clustering in reconstructing GO both when the data are complete, and incomplete. Some measures, particularly those that create a pairwise gene similarity by averaging over all pairwise annotation similarities, had consistently poor performance, and a few measures, such as Lin best-matched average and Relevance maximum, were generally among the best performers for a broad range in annotation completeness and types of GO classes. Finally, we show that for semantic similarity-based clustering, the multicellular organism process branch of the GO biological process ontology is more challenging to represent than the cellular process branch.

Conclusions

We assessed the effects of annotation completeness on the distribution of pairwise gene semantic similarity scores, and subsequent effects on the clusters derived from these scores. Our results suggest combinations of semantic similarity measures, gene-level scoring methods and clustering method that perform best for functional gene clustering using annotation sets of varying completeness. Overall, our results underscore the importance of increasing the completeness of GO annotations to for supporting computational analyses of gene function.

SUBMITTER: Liu M

PROVIDER: S-EPMC6437941 | biostudies-literature | 2019 Mar

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

GO functional similarity clustering depends on similarity measure, clustering method, and annotation completeness.

Liu Meng M Thomas Paul D PD

BMC bioinformatics 20190327 1

<h4>Background</h4>Biological knowledge, and therefore Gene Ontology annotation sets, for human genes is incomplete. Recent studies have reported that biases in available GO annotations result in biased estimates of functional similarities of genes, but it is still unclear what the effect of incompleteness itself may be, even in the absence of bias. Pairwise gene similarities are used in a number of contexts, including gene "functional similarity" clustering and the related problem of functional ...[more]

PMID: 30917779

Similar Datasets

Project description:BackgroundThe Gene Ontology (GO) is a well known controlled vocabulary describing the biological process, molecular function and cellular component aspects of gene annotation. It has become a widely used knowledge source in bioinformatics for annotating genes and measuring their semantic similarity. These measures generally involve the GO graph structure, the information content of GO aspects, or a combination of both. However, only a few of the semantic similarity measures described so far can handle GO annotations differently according to their origin (i.e. their evidence codes).ResultsWe present here a new semantic similarity measure called IntelliGO which integrates several complementary properties in a novel vector space model. The coefficients associated with each GO term that annotates a given gene or protein include its information content as well as a customized value for each type of GO evidence code. The generalized cosine similarity measure, used for calculating the dot product between two vectors, has been rigorously adapted to the context of the GO graph. The IntelliGO similarity measure is tested on two benchmark datasets consisting of KEGG pathways and Pfam domains grouped as clans, considering the GO biological process and molecular function terms, respectively, for a total of 683 yeast and human genes and involving more than 67,900 pair-wise comparisons. The ability of the IntelliGO similarity measure to express the biological cohesion of sets of genes compares favourably to four existing similarity measures. For inter-set comparison, it consistently discriminates between distinct sets of genes. Furthermore, the IntelliGO similarity measure allows the influence of weights assigned to evidence codes to be checked. Finally, the results obtained with a complementary reference technique give intermediate but correct correlation values with the sequence similarity, Pfam, and Enzyme classifications when compared to previously published measures.ConclusionsThe IntelliGO similarity measure provides a customizable and comprehensive method for quantifying gene similarity based on GO annotations. It also displays a robust set-discriminating power which suggests it will be useful for functional clustering.AvailabilityAn on-line version of the IntelliGO similarity measure is available at: http://bioinfo.loria.fr/Members/benabdsi/intelligo_project/

Project description:BackgroundThe rapid burgeoning of available protein data makes the use of clustering within families of proteins increasingly important. The challenge is to identify subfamilies of evolutionarily related sequences. This identification reveals phylogenetic relationships, which provide prior knowledge to help researchers understand biological phenomena. A good evolutionary model is essential to achieve a clustering that reflects the biological reality, and an accurate estimate of protein sequence similarity is crucial to the building of such a model. Most existing algorithms estimate this similarity using techniques that are not necessarily biologically plausible, especially for hard-to-align sequences such as proteins with different domain structures, which cause many difficulties for the alignment-dependent algorithms. In this paper, we propose a novel similarity measure based on matching amino acid subsequences. This measure, named SMS for Substitution Matching Similarity, is especially designed for application to non-aligned protein sequences. It allows us to develop a new alignment-free algorithm, named CLUSS, for clustering protein families. To the best of our knowledge, this is the first alignment-free algorithm for clustering protein sequences. Unlike other clustering algorithms, CLUSS is effective on both alignable and non-alignable protein families. In the rest of the paper, we use the term "phylogenetic" in the sense of "relatedness of biological functions".ResultsTo show the effectiveness of CLUSS, we performed an extensive clustering on COG database. To demonstrate its ability to deal with hard-to-align sequences, we tested it on the GH2 family. In addition, we carried out experimental comparisons of CLUSS with a variety of mainstream algorithms. These comparisons were made on hard-to-align and easy-to-align protein sequences. The results of these experiments show the superiority of CLUSS in yielding clusters of proteins with similar functional activity.ConclusionWe have developed an effective method and tool for clustering protein sequences to meet the needs of biologists in terms of phylogenetic analysis and prediction of biological functions. Compared to existing clustering methods, CLUSS more accurately highlights the functional characteristics of the clustered families. It provides biologists with a new and plausible instrument for the analysis of protein sequences, especially those that cause problems for the alignment-dependent algorithms.

Project description:Trypanosomatids are the causative agents of deadly diseases in humans and livestock. Given the high phylogenetic distance of trypanosomatids from model organisms, these organisms have ample unannotated genes. Manual functional annotation is time-consuming, highlighting the importance of automated functional annotation tools. The development of automated functional tools is a hot research topic, and multiple tools have been developed for the task. PANNZER2 is an automated functional annotation tool that merely relies on the sequence similarity of the query to the annotated proteins. We tried PANNZER2 on Trypanosoma brucei, the most studied organism among trypanosomatids, to see if it could improve our knowledge of the functions of the genes. Even with the availability of automated annotation tools like InterPro2GO in databases such as TriTrypDB, PANNZER2 has made surprisingly confident predictions for some hypothetical proteins in T. brucei. In this study, we identify gaps in such annotations because of not employing pairwise sequence alignment tools in TriTrypDB's automated annotation process. Our findings demonstrate that even the use of stringent cutoffs can successfully annotate a significant number of proteins. Additionally, we discovered that adjusting the open reading frames in certain genes leads to sequences with increased sequence signature coverage-characterized by the length covered by at least one sequence signature-compared to the original sequences. This enhanced sequence signature coverage suggests these genomic fragments could be pseudogenes. To facilitate further exploration, we developed a script to help identify potential pseudogenes within an organism's genome, offering researchers a new tool for genomic analysis and understanding. We extended all our analysis to Trypanosoma cruzi and Leishmania major to assess the impact of this approach across different species. Our study demonstrates that by utilizing pairwise sequence similarity alignment, even with stringent cutoffs, we can attribute 2986, 3953, and 3798 new GO terms to the genomes of T. brucei, T. cruzi, and L. major. Additionally, we found that 210, 239, and 29 genes exhibit increased sequence signature coverage following frame correction, suggesting the presence of pseudogenes.

Dataset Information

GO functional similarity clustering depends on similarity measure, clustering method, and annotation completeness.

Background

Results

Conclusions

Publications

GO functional similarity clustering depends on similarity measure, clustering method, and annotation completeness.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets