Dataset Information

Protein Sequence Comparison and DNA-binding Protein Identification with Generalized PseAAC and Graphical Representation.

ABSTRACT: AIM AND OBJECTIVE:The rapid increase in the amount of protein sequence data available leads to an urgent need for novel computational algorithms to analyze and compare these sequences. This study is undertaken to develop an efficient computational approach for timely encoding protein sequences and extracting the hidden information. METHODS:Based on two physicochemical properties of amino acids, a protein primary sequence was converted into a three-letter sequence, and then a graph without loops and multiple edges and its geometric line adjacency matrix were obtained. A generalized PseAAC (pseudo amino acid composition) model was thus constructed to characterize a protein sequence numerically. RESULTS:By using the proposed mathematical descriptor of a protein sequence, similarity comparisons among ?-globin proteins of 17 species and 72 spike proteins of coronaviruses were made, respectively. The resulting clusters agreed well with the established taxonomic groups. In addition, a generalized PseAAC based SVM (support vector machine) model was developed to identify DNA-binding proteins. Experiment results showed that our method performed better than DNAbinder, DNA-Prot, iDNA-Prot and enDNA-Prot by 3.29-10.44% in terms of ACC, 0.056-0.206 in terms of MCC, and 1.45-15.76% in terms of F1M. When the benchmark dataset was expanded with negative samples, the presented approach outperformed the four previous methods with improvement in the range of 2.49-19.12% in terms of ACC, 0.05-0.32 in terms of MCC, and 3.82- 33.85% in terms of F1M. CONCLUSION:These results suggested that the generalized PseAAC model was very efficient for comparison and analysis of protein sequences, and very competitive in identifying DNA-binding proteins.

SUBMITTER: Li C

PROVIDER: S-EPMC5930480 | biostudies-literature | 2018

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Protein Sequence Comparison and DNA-binding Protein Identification with Generalized PseAAC and Graphical Representation.

Li Chun C Zhao Jialing J Wang Changzhong C Yao Yuhua Y

Combinatorial chemistry & high throughput screening 20180101 2

<h4>Aim and objective</h4>The rapid increase in the amount of protein sequence data available leads to an urgent need for novel computational algorithms to analyze and compare these sequences. This study is undertaken to develop an efficient computational approach for timely encoding protein sequences and extracting the hidden information.<h4>Methods</h4>Based on two physicochemical properties of amino acids, a protein primary sequence was converted into a three-letter sequence, and then a graph ...[more]

PMID: 29380690

Similar Datasets

Project description:BackgroundThe structure and function of bacterial nucleoid are controlled by histone-like proteins of HU/IHF family, omnipresent in bacteria and also founding archaea and some eukaryotes.HU protein binds dsDNA without sequence specificity and avidly binds DNA structures with propensity to be inclined such as forks, three/four-way junctions, nicks, overhangs and DNA bulges. Sequence comparison of thousands of known histone-like proteins from diverse bacteria phyla reveals relation between HU/IHF sequence, DNA-binding properties and other protein features.Methodology and principal findingsPerformed alignment and clusterization of the protein sequences show that HU/IHF family proteins can be unambiguously divided into three groups, HU proteins, IHF_A and IHF_B proteins. HU proteins, IHF_A and IHF_B proteins are further partitioned into several clades for IHF and HU; such a subdivision is in good agreement with bacterial taxonomy. We also analyzed a hundred of 3D fold comparative models built for HU sequences from all revealed HU clades. It appears that HU fold remains similar in spite of the HU sequence variations. We studied DNA-binding properties of HU from N. gonorrhoeae, which sequence is similar to one of E.coli HU, and HU from M. gallisepticum and S. melliferum which sequences are distant from E.coli protein. We found that in respect to dsDNA binding, only S. melliferum HU essentially differs from E.coli HU. In respect to binding of distorted DNA structures, S. melliferum HU and E.coli HU have similar properties but essentially different from M. gallisepticum HU and N. gonorrhea HU. We found that in respect to dsDNA binding, only S. melliferum HU binds DNA in non-cooperative manner and both mycoplasma HU bend dsDNA stronger than E.coli and N. gonorrhoeae. In respect to binding to distorted DNA structures, each HU protein has its individual profile of affinities to various DNA-structures with the increased specificity to DNA junction.Conclusions and significanceHU/IHF family proteins sequence alignment and classification are updated. Comparative modeling demonstrates that HU protein 3D folding's even more conservative than HU sequence. For the first time, DNA binding characteristics of HU from N. gonorrhoeae, M. gallisepticum and S. melliferum are studied. Here we provide detailed analysis of the similarity and variability of DNA-recognizing and bending of four HU proteins from closely and distantly related HU clades.

Dataset Information

Protein Sequence Comparison and DNA-binding Protein Identification with Generalized PseAAC and Graphical Representation.

Publications

Protein Sequence Comparison and DNA-binding Protein Identification with Generalized PseAAC and Graphical Representation.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets