Dataset Information

SECOM: a novel hash seed and community detection based-approach for genome-scale protein domain identification.

ABSTRACT: With rapid advances in the development of DNA sequencing technologies, a plethora of high-throughput genome and proteome data from a diverse spectrum of organisms have been generated. The functional annotation and evolutionary history of proteins are usually inferred from domains predicted from the genome sequences. Traditional database-based domain prediction methods cannot identify novel domains, however, and alignment-based methods, which look for recurring segments in the proteome, are computationally demanding. Here, we propose a novel genome-wide domain prediction method, SECOM. Instead of conducting all-against-all sequence alignment, SECOM first indexes all the proteins in the genome by using a hash seed function. Local similarity can thus be detected and encoded into a graph structure, in which each node represents a protein sequence and each edge weight represents the shared hash seeds between the two nodes. SECOM then formulates the domain prediction problem as an overlapping community-finding problem in this graph. A backward graph percolation algorithm that efficiently identifies the domains is proposed. We tested SECOM on five recently sequenced genomes of aquatic animals. Our tests demonstrated that SECOM was able to identify most of the known domains identified by InterProScan. When compared with the alignment-based method, SECOM showed higher sensitivity in detecting putative novel domains, while it was also three orders of magnitude faster. For example, SECOM was able to predict a novel sponge-specific domain in nucleoside-triphosphatase (NTPases). Furthermore, SECOM discovered two novel domains, likely of bacterial origin, that are taxonomically restricted to sea anemone and hydra. SECOM is an open-source program and available at http://sfb.kaust.edu.sa/Pages/Software.aspx.

SUBMITTER: Fan M

PROVIDER: S-EPMC3386278 | biostudies-literature | 2012

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

SECOM: a novel hash seed and community detection based-approach for genome-scale protein domain identification.

Fan Ming M Wong Ka-Chun KC Ryu Taewoo T Ravasi Timothy T Gao Xin X

PloS one 20120628 6

With rapid advances in the development of DNA sequencing technologies, a plethora of high-throughput genome and proteome data from a diverse spectrum of organisms have been generated. The functional annotation and evolutionary history of proteins are usually inferred from domains predicted from the genome sequences. Traditional database-based domain prediction methods cannot identify novel domains, however, and alignment-based methods, which look for recurring segments in the proteome, are compu ...[more]

PMID: 22761802

Similar Datasets

Project description:A large number of highly pathogenic bacteria utilize secretion systems to translocate effector proteins into host cells. Using these effectors, the bacteria subvert host cell processes during infection. Legionella pneumophila translocates effectors via the Icm/Dot type-IV secretion system and to date, approximately 100 effectors have been identified by various experimental and computational techniques. Effector identification is a critical first step towards the understanding of the pathogenesis system in L. pneumophila as well as in other bacterial pathogens. Here, we formulate the task of effector identification as a classification problem: each L. pneumophila open reading frame (ORF) was classified as either effector or not. We computationally defined a set of features that best distinguish effectors from non-effectors. These features cover a wide range of characteristics including taxonomical dispersion, regulatory data, genomic organization, similarity to eukaryotic proteomes and more. Machine learning algorithms utilizing these features were then applied to classify all the ORFs within the L. pneumophila genome. Using this approach we were able to predict and experimentally validate 40 new effectors, reaching a success rate of above 90%. Increasing the number of validated effectors to around 140, we were able to gain novel insights into their characteristics. Effectors were found to have low G+C content, supporting the hypothesis that a large number of effectors originate via horizontal gene transfer, probably from their protozoan host. In addition, effectors were found to cluster in specific genomic regions. Finally, we were able to provide a novel description of the C-terminal translocation signal required for effector translocation by the Icm/Dot secretion system. To conclude, we have discovered 40 novel L. pneumophila effectors, predicted over a hundred additional highly probable effectors, and shown the applicability of machine learning algorithms for the identification and characterization of bacterial pathogenesis determinants.

Project description:BACKGROUND: Assessing protein modularity is important to understand protein evolution. Still the question of the existence of a sub-domain modular architecture remains. We propose a graph-theory approach with significance and power testing to identify modules in protein structures. In the first step, clusters are determined by optimizing the partition that maximizes the modularity score. Second, each cluster is tested for significance. Significant clusters are referred to as modules. Evolutionary modules are identified by analyzing homologous structures. Dynamic modules are inferred from sets of snapshots of molecular simulations. We present here a methodology to identify sub-domain architecture robustly, biologically meaningful, and statistically supported. RESULTS: The robustness of this new method is tested using simulated data with known modularity. Modules are correctly identified even when there is a low correlation between landmarks within a module. We also analyzed the evolutionary modularity of a data set of ?-amylase catalytic domain homologs, and the dynamic modularity of the Niemann-Pick C1 (NPC1) protein N-terminal domain.The ?-amylase contains an (?/?)8 barrel (TIM barrel) with the polysaccharides cleavage site and a calcium-binding domain. In this data set we identified four robust evolutionary modules, one of which forms the minimal functional TIM barrel topology.The NPC1 protein is involved in the intracellular lipid metabolism coordinating sterol trafficking. NPC1 N-terminus is the first luminal domain which binds to cholesterol and its oxygenated derivatives. Our inferred dynamic modules in the protein NPC1 are also shown to match functional components of the protein related to the NPC1 disease. CONCLUSIONS: A domain compartmentalization can be found and described in correlation space. To our knowledge, there is no other method attempting to identify sub-domain architecture from the correlation among residues. Most attempts made focus on sequence motifs of protein-protein interactions, binding sites, or sequence conservancy. We were able to describe functional/structural sub-domain architecture related to key residues for starch cleavage, calcium, and chloride binding sites in the ?-amylase, and sterol opening-defining modules and disease-related residues in the NPC1. We also described the evolutionary sub-domain architecture of the ?-amylase catalytic domain, identifying the already reported minimum functional TIM barrel.

Project description:BackgroundMultiple proteins containing BURP domain have been identified in many different plant species, but not in any other organisms. To date, the molecular function of the BURP domain is still unknown, and no systematic analysis and expression profiling of the gene family in soybean (Glycine max) has been reported.ResultsIn this study, multiple bioinformatics approaches were employed to identify all the members of BURP family genes in soybean. A total of 23 BURP gene types were identified. These genes had diverse structures and were distributed on chromosome 1, 2, 4, 6, 7, 8, 11, 12, 13, 14, and 18. Phylogenetic analysis suggested that these BURP family genes could be classified into 5 subfamilies, and one of which defines a new subfamily, BURPV. Quantitative real-time PCR (qRT-PCR) analysis of transcript levels showed that 15 of the 23 genes had no expression specificity; 7 of them were specifically expressed in some of the tissues; and one of them was not expressed in any of the tissues or organs studied. The results of stress treatments showed that 17 of the 23 identified BURP family genes responded to at least one of the three stress treatments; 6 of them were not influenced by stress treatments even though a stress related cis-element was identified in the promoter region. No stress related cis-elements were found in promoter region of any BURPV member. However, qRT-PCR results indicated that all members from BURPV responded to at least one of the three stress treatments. More significantly, the members from the RD22-like subfamily showed no tissue-specific expression and they all responded to each of the three stress treatments.ConclusionsWe have identified and classified all the BURP domain-containing genes in soybean. Their expression patterns in different tissues and under different stress treatments were detected using qRT-PCR. 15 out of 23 BURP genes in soybean had no tissue-specific expression, while 17 out of them were stress-responsive. The data provided an insight into the evolution of the gene family and suggested that many BURP family genes may be important for plants responding to stress conditions.

Dataset Information

SECOM: a novel hash seed and community detection based-approach for genome-scale protein domain identification.

Publications

SECOM: a novel hash seed and community detection based-approach for genome-scale protein domain identification.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets