Dataset Information

A bacterial phyla dataset for protein function prediction.

ABSTRACT: Protein function prediction has been the most worked upon and the most challenging problem for computational biologists. The vast majority of known proteins have yet not been characterised experimentally, and there is significant gap between their structures and functions. New un-annotated sequences are being added to the public protein databases (e.g. UniprotKB) at an enormous pace [1]. Such proteins with unknown functions might play key role in the metabolism, growth and development regulation. Thus, if functions of unknown proteins left undiscovered, researchers may skip important information(s). Based on their sequence, structure, evolutionary history, and their association with other proteins, tools of computational biology can provide insights into the function of proteins [2]. For proteins with well characterised close relatives, it is trivial to infer function. Orphan proteins without discernible sequence relatives present a greater challenge [3]. Here the task of experimental characterisation is blind and becomes unwieldy. It is highly unlikely that all known proteins will ever be completely experimentally characterised [4]. Thus, there is an emergent need to develop fast and accurate computational approaches to fulfil this requirement. Towards this end, we prepared a dataset for protein function prediction by extracting protein sequences and annotations of reviewed prokaryotic proteins (total count 323,719 as accessed on date March 10, 2019) belonging to 9 bacterial phyla Actinobacteria, Bacteroidetes, Chlamydiae, Cyanobacteria, Firmicutes, Fusobacteria, Proteobacteria, Spirochaetes and Tenericutes. Corresponding to the most frequent 1739 Gene Ontology (Molecular Function) terms, samples were filtered, and 171,212 proteins were retrieved for feature generation. The Dataset was generated by calculating the sequence, sub-sequence, physiochemical, annotation-based features for each 171,212 reviewed proteins using method in [10]. These features constitute a total of 9890 attributes for each sequence of protein along with 1739 Gene Ontology terms. Each protein sequence is assigned one or more of 1739 Gene Ontology (Molecular Function) term as its target label. The Dataset contains the Entry and Entry name of each sequence corresponding to UniprotKB Database. This dataset being huge in size (171,212 samples X 9890 features, 1739 classes with multiple values) and equipped with enough number of positive and negative samples of each 1739 class, is good for testing efficiency of any upcoming deep learning models [5]. We divided the full dataset of 171,212 reviewed proteins in the ratio 3:1 to form Train/Test dataset 1; train dataset with 128,409 samples and test dataset with 42,803 samples to facilitate training of a deep learning model. The train and test datasets are stratified to contain good proportion of each 1739 classes. We then prepared a dataset 2 of pathogenic unreviewed proteins of the 9 bacterial phyla each with 9890 features same as train/train dataset of reviewed proteins but without target labels in order to predict their functions using deep learning model proposed in [5].

SUBMITTER: Mishra S

PROVIDER: S-EPMC6950771 | biostudies-literature | 2020 Feb

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

A bacterial phyla dataset for protein function prediction.

Mishra Sarthak S Rastogi Yash Pratap YP Jabin Suraiya S Kaur Punit P Amir Mohammad M Khatoon Shabanam S

Data in brief 20191218

Protein function prediction has been the most worked upon and the most challenging problem for computational biologists. The vast majority of known proteins have yet not been characterised experimentally, and there is significant gap between their structures and functions. New un-annotated sequences are being added to the public protein databases (e.g. UniprotKB) at an enormous pace [1]. Such proteins with unknown functions might play key role in the metabolism, growth and development regulation ...[more]

PMID: 31921945

Similar Datasets

Project description:A fundamental requirement for life is the replication of an organism's DNA. Studies in Escherichia coli and Bacillus subtilis have set the paradigm for DNA replication in bacteria. During replication initiation in E. coli and B. subtilis, the replicative helicase is loaded onto the DNA at the origin of replication by an ATPase helicase loader. However, most bacteria do not encode homologs to the helicase loaders in E. coli and B. subtilis. Recent work has identified the DciA protein as a predicted helicase operator that may perform a function analogous to the helicase loaders in E. coli and B. subtilis. DciA proteins, which are defined by the presence of a DUF721 domain (termed the DciA domain herein), are conserved in most bacteria but have only been studied in mycobacteria and gammaproteobacteria (Pseudomonas aeruginosa and Vibrio cholerae). Sequences outside the DciA domain in Mycobacterium tuberculosis DciA are essential for protein function but are not conserved in the P. aeruginosa and V. cholerae homologs, raising questions regarding the conservation and evolution of DciA proteins across bacterial phyla. To comprehensively define the DciA protein family, we took a computational evolutionary approach and analyzed the domain architectures and sequence properties of DciA domain-containing proteins across the tree of life. These analyses identified lineage-specific domain architectures among DciA homologs, as well as broadly conserved sequence-structural motifs. The diversity of DciA proteins represents the evolution of helicase operation in bacterial DNA replication and highlights the need for phylum-specific analyses of this fundamental biological process. IMPORTANCE Despite the fundamental importance of DNA replication for life, this process remains understudied in bacteria outside Escherichia coli and Bacillus subtilis. In particular, most bacteria do not encode the helicase-loading proteins that are essential in E. coli and B. subtilis for DNA replication. Instead, most bacteria encode a DciA homolog that likely constitutes the predominant mechanism of helicase operation in bacteria. However, it is still unknown how DciA structure and function compare across diverse phyla that encode DciA proteins. In this study, we performed computational evolutionary analyses to uncover tremendous diversity among DciA homologs. These studies provide a significant advance in our understanding of an essential component of the bacterial DNA replication machinery.

Project description:BACKGROUND: Automated function prediction has played a central role in determining the biological functions of bacterial proteins. Typically, protein function annotation relies on homology, and function is inferred from other proteins with similar sequences. This approach has become popular in bacterial genomics because it is one of the few methods that is practical for large datasets and because it does not require additional functional genomics experiments. However, the existing solutions produce erroneous predictions in many cases, especially when query sequences have low levels of identity with the annotated source protein. This problem has created a pressing need for improvements in homology-based annotation. RESULTS: We present an automated method for the functional annotation of bacterial protein sequences. Based on sequence similarity searches, BLANNOTATOR accurately annotates query sequences with one-line summary descriptions of protein function. It groups sequences identified by BLAST into subsets according to their annotation and bases its prediction on a set of sequences with consistent functional information. We show the results of BLANNOTATOR's performance in sets of bacterial proteins with known functions. We simulated the annotation process for 3090 SWISS-PROT proteins using a database in its state preceding the functional characterisation of the query protein. For this dataset, our method outperformed the five others that we tested, and the improved performance was maintained even in the absence of highly related sequence hits. We further demonstrate the value of our tool by analysing the putative proteome of Lactobacillus crispatus strain ST1. CONCLUSIONS: BLANNOTATOR is an accurate method for bacterial protein function prediction. It is practical for genome-scale data and does not require pre-existing sequence clustering; thus, this method suits the needs of bacterial genome and metagenome researchers. The method and a web-server are available at http://ekhidna.biocenter.helsinki.fi/poxo/blannotator/.

Dataset Information

A bacterial phyla dataset for protein function prediction.

Publications

A bacterial phyla dataset for protein function prediction.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets