Dataset Information

Automatic structure classification of small proteins using random forest.

ABSTRACT: BACKGROUND: Random forest, an ensemble based supervised machine learning algorithm, is used to predict the SCOP structural classification for a target structure, based on the similarity of its structural descriptors to those of a template structure with an equal number of secondary structure elements (SSEs). An initial assessment of random forest is carried out for domains consisting of three SSEs. The usability of random forest in classifying larger domains is demonstrated by applying it to domains consisting of four, five and six SSEs. RESULTS: Random forest, trained on SCOP version 1.69, achieves a predictive accuracy of up to 94% on an independent and non-overlapping test set derived from SCOP version 1.73. For classification to the SCOP Class, Fold, Super-family or Family levels, the predictive quality of the model in terms of Matthew's correlation coefficient (MCC) ranged from 0.61 to 0.83. As the number of constituent SSEs increases the MCC for classification to different structural levels decreases. CONCLUSIONS: The utility of random forest in classifying domains from the place-holder classes of SCOP to the true Class, Fold, Super-family or Family levels is demonstrated. Issues such as introduction of a new structural level in SCOP and the merger of singleton levels can also be addressed using random forest. A real-world scenario is mimicked by predicting the classification for those protein structures from the PDB, which are yet to be assigned to the SCOP classification hierarchy.

SUBMITTER: Jain P

PROVIDER: S-EPMC2916923 | biostudies-literature | 2010

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Automatic structure classification of small proteins using random forest.

Jain Pooja P Hirst Jonathan D JD

BMC bioinformatics 20100701

<h4>Background</h4>Random forest, an ensemble based supervised machine learning algorithm, is used to predict the SCOP structural classification for a target structure, based on the similarity of its structural descriptors to those of a template structure with an equal number of secondary structure elements (SSEs). An initial assessment of random forest is carried out for domains consisting of three SSEs. The usability of random forest in classifying larger domains is demonstrated by applying it ...[more]

PMID: 20594334

Similar Datasets

Project description:Introduction: Group A rotaviruses are major pathogens in causing severe diarrhea in young children and neonates of many different species of animals worldwide and group A rotavirus sequence data are becoming increasingly available over time. Different methods exist that allow for rotavirus genotyping, but machine learning methods have yet to be explored. Usage of machine learning algorithms such as random forest alongside alignment-based methodology may allow for both efficient and accurate classification of circulating rotavirus genotypes through the dual classification system. Methods: Random forest models were trained on positional features obtained from pairwise and multiple sequence alignment and cross-validated using methods of repeated 10-fold cross-validation thrice and leave one- out cross validation. Models were then validated on unseen data from the testing datasets to observe real-world performance. Results: All models were found to perform strongly in classification of VP7 and VP4 genotypes with high overall accuracy and kappa values during model training (0.975-0.992, 0.970-0.989) and during model testing (0.972-0.996, 0.969-0.996), respectively. Models trained on multiple sequence alignment generally had slightly higher overall accuracy and kappa values than models trained on pairwise sequence alignment method. In contrast, pairwise sequence alignment models were found to be generally faster than multiple sequence alignment models in computational speed when models do not need to be retrained. Models that used repeated 10-fold cross-validation thrice were also found to be much faster in model computational speed than models that used leave-one-out cross validation, with no noticeable difference in overall accuracy and kappa values between the cross-validation methods. Discussion: Overall, random forest models showed strong performance in the classification of both group A rotavirus VP7 and VP4 genotypes. Application of these models as classifiers will allow for rapid and accurate classification of the increasing amounts of rotavirus sequence data that are becoming available.

Project description:BackgroundMajority of influenza A viruses reside and circulate among animal populations, seldom infecting humans due to host range restriction. Yet when some avian strains do acquire the ability to overcome species barrier, they might become adapted to humans, replicating efficiently and causing diseases, leading to potential pandemic. With the huge influenza A virus reservoir in wild birds, it is a cause for concern when a new influenza strain emerges with the ability to cross host species barrier, as shown in light of the recent H7N9 outbreak in China. Several influenza proteins have been shown to be major determinants in host tropism. Further understanding and determining host tropism would be important in identifying zoonotic influenza virus strains capable of crossing species barrier and infecting humans.ResultsIn this study, computational models for 11 influenza proteins have been constructed using the machine learning algorithm random forest for prediction of host tropism. The prediction models were trained on influenza protein sequences isolated from both avian and human samples, which were transformed into amino acid physicochemical properties feature vectors. The results were highly accurate prediction models (ACC>96.57; AUC>0.980; MCC>0.916) capable of determining host tropism of individual influenza proteins. In addition, features from all 11 proteins were used to construct a combined model to predict host tropism of influenza virus strains. This would help assess a novel influenza strain's host range capability.ConclusionsFrom the prediction models constructed, all achieved high prediction performance, indicating clear distinctions in both avian and human proteins. When used together as a host tropism prediction system, zoonotic strains could potentially be identified based on different protein prediction results. Understanding and predicting host tropism of influenza proteins lay an important foundation for future work in constructing computation models capable of directly predicting interspecies transmission of influenza viruses. The models are available for prediction at http://fluleap.bic.nus.edu.sg.

Project description:Redox conditions in groundwater may markedly affect the fate and transport of nutrients, volatile organic compounds, and trace metals, with significant implications for human health. While many local assessments of redox conditions have been made, the spatial variability of redox reaction rates makes the determination of redox conditions at regional or national scales problematic. In this study, redox conditions in groundwater were predicted for the contiguous United States using random forest classification by relating measured water quality data from over 30,000 wells to natural and anthropogenic factors. The model correctly predicted the oxic/suboxic classification for 78 and 79% of the samples in the out-of-bag and hold-out data sets, respectively. Variables describing geology, hydrology, soil properties, and hydrologic position were among the most important factors affecting the likelihood of oxic conditions in groundwater. Important model variables tended to relate to aquifer recharge, groundwater travel time, or prevalence of electron donors, which are key drivers of redox conditions in groundwater. Partial dependence plots suggested that the likelihood of oxic conditions in groundwater decreased sharply as streams were approached and gradually as the depth below the water table increased. The probability of oxic groundwater increased as base flow index values increased, likely due to the prevalence of well-drained soils and geologic materials in high base flow index areas. The likelihood of oxic conditions increased as topographic wetness index (TWI) values decreased. High topographic wetness index values occur in areas with a propensity for standing water and overland flow, conditions that limit the delivery of dissolved oxygen to groundwater by recharge; higher TWI values also tend to occur in discharge areas, which may contain groundwater with long travel times. A second model was developed to predict the probability of elevated manganese (Mn) concentrations in groundwater (i.e., ≥50 μg/L). The Mn model relied on many of the same variables as the oxic/suboxic model and may be used to identify areas where Mn-reducing conditions occur and where there is an increased risk to domestic water supplies due to high Mn concentrations. Model predictions of redox conditions in groundwater produced in this study may help identify regions of the country with elevated groundwater vulnerability and stream vulnerability to groundwater-derived contaminants.

Dataset Information

Automatic structure classification of small proteins using random forest.

Publications

Automatic structure classification of small proteins using random forest.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets