Dataset Information

Automated hierarchical classification of protein domain subfamilies based on functionally-divergent residue signatures.

ABSTRACT:

Background

The NCBI Conserved Domain Database (CDD) consists of a collection of multiple sequence alignments of protein domains that are at various stages of being manually curated into evolutionary hierarchies based on conserved and divergent sequence and structural features. These domain models are annotated to provide insights into the relationships between sequence, structure and function via web-based BLAST searches.

Results

Here we automate the generation of conserved domain (CD) hierarchies using a combination of heuristic and Markov chain Monte Carlo (MCMC) sampling procedures and starting from a (typically very large) multiple sequence alignment. This procedure relies on statistical criteria to define each hierarchy based on the conserved and divergent sequence patterns associated with protein functional-specialization. At the same time this facilitates the sequence and structural annotation of residues that are functionally important. These statistical criteria also provide a means to objectively assess the quality of CD hierarchies, a non-trivial task considering that the protein subgroups are often very distantly related--a situation in which standard phylogenetic methods can be unreliable. Our aim here is to automatically generate (typically sub-optimal) hierarchies that, based on statistical criteria and visual comparisons, are comparable to manually curated hierarchies; this serves as the first step toward the ultimate goal of obtaining optimal hierarchical classifications. A plot of runtimes for the most time-intensive (non-parallelizable) part of the algorithm indicates a nearly linear time complexity so that, even for the extremely large Rossmann fold protein class, results were obtained in about a day.

Conclusions

This approach automates the rapid creation of protein domain hierarchies and thus will eliminate one of the most time consuming aspects of conserved domain database curation. At the same time, it also facilitates protein domain annotation by identifying those pattern residues that most distinguish each protein domain subgroup from other related subgroups.

SUBMITTER: Neuwald AF

PROVIDER: S-EPMC3599474 | biostudies-literature | 2012 Jun

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Automated hierarchical classification of protein domain subfamilies based on functionally-divergent residue signatures.

Neuwald Andrew F AF Lanczycki Christopher J CJ Marchler-Bauer Aron A

BMC bioinformatics 20120622

<h4>Background</h4>The NCBI Conserved Domain Database (CDD) consists of a collection of multiple sequence alignments of protein domains that are at various stages of being manually curated into evolutionary hierarchies based on conserved and divergent sequence and structural features. These domain models are annotated to provide insights into the relationships between sequence, structure and function via web-based BLAST searches.<h4>Results</h4>Here we automate the generation of conserved domain ...[more]

PMID: 22726767

Similar Datasets

Project description:Osteoarthritis (OA) is a global healthcare problem. The increasing population of OA patients demands a greater bandwidth of imaging and diagnostics. It is important to provide automatic and objective diagnostic techniques to address this challenge. This study demonstrates the utility of unsupervised domain adaptation (UDA) for automated OA phenotype classification. We collected 318 and 960 three-dimensional double-echo steady-state magnetic resonance images from the Osteoarthritis Initiative (OAI) dataset as the source dataset for phenotype cartilage/meniscus and subchondral bone, respectively. Fifty three-dimensional turbo spin echo (TSE)/fast spin echo (FSE) MR images from our institute were collected as the target datasets. For each patient, the degree of knee OA was initially graded according to the MRI Knee Osteoarthritis Knee Score before being converted to binary OA phenotype labels. The proposed four-step UDA pipeline included (I) pre-processing, which involved automatic segmentation and region-of-interest cropping; (II) source classifier training, which involved pre-training a convolutional neural network (CNN) encoder for phenotype classification using the source dataset; (III) target encoder adaptation, which involved unsupervised adjustment of the source encoder to the target encoder using both the source and target datasets; and (IV) target classifier validation, which involved statistical analysis of the classification performance evaluated by the area under the receiver operating characteristic curve (AUROC), sensitivity, specificity and accuracy. We compared our model on the target data with the source pre-trained model and the model trained with the target data from scratch. For phenotype cartilage/meniscus, our model has the best performance out of the three models, giving 0.90 [95% confidence interval (CI): 0.79-1.02] of the AUROC score, while the other two model show 0.52 (95% CI: 0.13-0.90) and 0.76 (95% CI: 0.53-0.98). For phenotype subchondral bone, our model gave 0.75 (95% CI: 0.56-0.94) at AUROC, which has a close performance of the source pre-trained model (0.76, 95% CI: 0.55-0.98), and better than the model trained from scratch on the target dataset only (0.53, 95% CI: 0.33-0.73). By utilising a large, high-quality source dataset for training, the proposed UDA approach enhances the performance of automated OA phenotype classification for small target datasets. As a result, our technique enables improved downstream analysis of locally collected datasets with a small sample size.

Project description:Function prediction by homology is widely used to provide preliminary functional annotations for genes for which experimental evidence of function is unavailable or limited. This approach has been shown to be prone to systematic error, including percolation of annotation errors through sequence databases. Phylogenomic analysis avoids these errors in function prediction but has been difficult to automate for high-throughput application. To address this limitation, we present a computationally efficient pipeline for phylogenomic classification of proteins. This pipeline uses the SCI-PHY (Subfamily Classification in Phylogenomics) algorithm for automatic subfamily identification, followed by subfamily hidden Markov model (HMM) construction. A simple and computationally efficient scoring scheme using family and subfamily HMMs enables classification of novel sequences to protein families and subfamilies. Sequences representing entirely novel subfamilies are differentiated from those that can be classified to subfamilies in the input training set using logistic regression. Subfamily HMM parameters are estimated using an information-sharing protocol, enabling subfamilies containing even a single sequence to benefit from conservation patterns defining the family as a whole or in related subfamilies. SCI-PHY subfamilies correspond closely to functional subtypes defined by experts and to conserved clades found by phylogenetic analysis. Extensive comparisons of subfamily and family HMM performances show that subfamily HMMs dramatically improve the separation between homologous and non-homologous proteins in sequence database searches. Subfamily HMMs also provide extremely high specificity of classification and can be used to predict entirely novel subtypes. The SCI-PHY Web server at http://phylogenomics.berkeley.edu/SCI-PHY/ allows users to upload a multiple sequence alignment for subfamily identification and subfamily HMM construction. Biologists wishing to provide their own subfamily definitions can do so. Source code is available on the Web page. The Berkeley Phylogenomics Group PhyloFacts resource contains pre-calculated subfamily predictions and subfamily HMMs for more than 40,000 protein families and domains at http://phylogenomics.berkeley.edu/phylofacts/.

Project description:Members of the WXG100 protein superfamily form homo- or heterodimeric complexes. The most studied proteins among them are the secreted T-cell antigens CFP-10 (10 kDa culture filtrate protein, EsxB) and ESAT-6 (6 kDa early secreted antigen target, EsxA) from Mycobacterium tuberculosis. They are encoded on an operon within a gene cluster, named as ESX-1, that encodes for the Type VII secretion system (T7SS). WXG100 proteins are secreted in a full-length form and it is known that they adopt a four-helix bundle structure. In the current work we discuss the evolutionary relationship between the homo- and heterodimeric WXG100 proteins, the basis of the oligomeric state and the key structural features of the conserved sequence pattern of WXG100 proteins. We performed an iterative bioinformatics analysis of the WXG100 protein superfamily and correlated this with the atomic structures of the representative WXG100 proteins. We find, firstly, that the WXG100 protein superfamily consists of three subfamilies: CFP-10-, ESAT-6- and sagEsxA-like proteins (EsxA proteins similar to that of Streptococcus agalactiae). Secondly, that the heterodimeric complexes probably evolved from a homodimeric precursor. Thirdly, that the genes of hetero-dimeric WXG100 proteins are always encoded in bi-cistronic operons and finally, by combining the sequence alignments with the X-ray data we identify a conserved C-terminal sequence pattern. The side chains of these conserved residues decorate the same side of the C-terminal α-helix and therefore form a distinct surface. Our results lead to a putatively extended T7SS secretion signal which combines two reported T7SS recognition characteristics: Firstly that the T7SS secretion signal is localized at the C-terminus of T7SS substrates and secondly that the conserved residues YxxxD/E are essential for T7SS activity. Furthermore, we propose that the specific α-helical surface formed by the conserved sequence pattern including YxxxD/E motif is a key component of T7SS-substrate recognition.

Project description:BACKGROUND:The experience with running various types of classification on the CAMDA neuroblastoma dataset have led us to the conclusion that the results are not always obvious and may differ depending on type of analysis and selection of genes used for classification. This paper aims in pointing out several factors that may influence the downstream machine learning analysis. In particular those factors are: type of the primary analysis, type of the classifier and increased correlation between the genes sharing a protein domain. They influence the analysis directly, but also interplay between them may be important. We have compiled the gene-domain database and used it for analysis to see the differences between the genes that share a domain versus the rest of the genes in the datasets. RESULTS:The major findings are: pairs of genes that share a domain have an increased Spearman's correlation coefficients of counts; genes sharing a domain are expected to have a lower predictive power due to increased correlation. For most of the cases it can be seen with the higher number of misclassified samples; classifiers performance may vary depending on a method, still in most cases using genes sharing a domain in the training set results in a higher misclassification rate; increased correlation in genes sharing a domain results most often in worse performance of the classifiers regardless of the primary analysis tools used, even if the primary analysis alignment yield varies. CONCLUSIONS:The effect of sharing a domain is likely more a results of real biological co-expression than just sequence similarity and artifacts of mapping and counting. Still, this is more difficult to conclude and needs further research. The effect is interesting itself, but we also point out some practical aspects in which it may influence the RNA sequencing analysis and RNA biomarker use. In particular it means that a gene signature biomarker set build out of RNA-sequencing results should be depleted for genes sharing common domains. It may cause to perform better when applying classification. REVIEWERS:This article was reviewed by Dimitar Vassiliev and Susmita Datta.

Dataset Information

Automated hierarchical classification of protein domain subfamilies based on functionally-divergent residue signatures.

Background

Results

Conclusions

Publications

Automated hierarchical classification of protein domain subfamilies based on functionally-divergent residue signatures.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets