Dataset Information

Testing statistical significance scores of sequence comparison methods with structure similarity.

ABSTRACT: BACKGROUND: In the past years the Smith-Waterman sequence comparison algorithm has gained popularity due to improved implementations and rapidly increasing computing power. However, the quality and sensitivity of a database search is not only determined by the algorithm but also by the statistical significance testing for an alignment. The e-value is the most commonly used statistical validation method for sequence database searching. The CluSTr database and the Protein World database have been created using an alternative statistical significance test: a Z-score based on Monte-Carlo statistics. Several papers have described the superiority of the Z-score as compared to the e-value, using simulated data. We were interested if this could be validated when applied to existing, evolutionary related protein sequences. RESULTS: All experiments are performed on the ASTRAL SCOP database. The Smith-Waterman sequence comparison algorithm with both e-value and Z-score statistics is evaluated, using ROC, CVE and AP measures. The BLAST and FASTA algorithms are used as reference. We find that two out of three Smith-Waterman implementations with e-value are better at predicting structural similarities between proteins than the Smith-Waterman implementation with Z-score. SSEARCH especially has very high scores. CONCLUSION: The compute intensive Z-score does not have a clear advantage over the e-value. The Smith-Waterman implementations give generally better results than their heuristic counterparts. We recommend using the SSEARCH algorithm combined with e-values for pairwise sequence comparisons.

SUBMITTER: Hulsen T

PROVIDER: S-EPMC1618413 | biostudies-other | 2006

REPOSITORIES: biostudies-other

ACCESS DATA

Publications

Testing statistical significance scores of sequence comparison methods with structure similarity.

Hulsen Tim T de Vlieg Jacob J Leunissen Jack A M JA Groenen Peter M A PM

BMC bioinformatics 20061012

<h4>Background</h4>In the past years the Smith-Waterman sequence comparison algorithm has gained popularity due to improved implementations and rapidly increasing computing power. However, the quality and sensitivity of a database search is not only determined by the algorithm but also by the statistical significance testing for an alignment. The e-value is the most commonly used statistical validation method for sequence database searching. The CluSTr database and the Protein World database hav ...[more]

PMID: 17038163

Similar Datasets

Project description:BackgroundPredicting protein function from primary sequence is an important open problem in modern biology. Not only are there many thousands of proteins of unknown function, current approaches for predicting function must be improved upon. One problem in particular is overly-specific function predictions which we address here with a new statistical model of the relationship between protein sequence similarity and protein function similarity.MethodologyOur statistical model is based on sets of proteins with experimentally validated functions and numeric measures of function specificity and function similarity derived from the Gene Ontology. The model predicts the similarity of function between two proteins given their amino acid sequence similarity measured by statistics from the BLAST sequence alignment algorithm. A novel aspect of our model is that it predicts the degree of function similarity shared between two proteins over a continuous range of sequence similarity, facilitating prediction of function with an appropriate level of specificity.SignificanceOur model shows nearly exact function similarity for proteins with high sequence similarity (bit score >244.7, e-value >1e(-62), non-redundant NCBI protein database (NRDB)) and only small likelihood of specific function match for proteins with low sequence similarity (bit score <54.6, e-value <1e(-05), NRDB). For sequence similarity ranges in between our annotation model shows an increasing relationship between function similarity and sequence similarity, but with considerable variability. We applied the model to a large set of proteins of unknown function, and predicted functions for thousands of these proteins ranging from general to very specific. We also applied the model to a data set of proteins with previously assigned, specific functions that were electronically based. We show that, on average, these prior function predictions are more specific (quite possibly overly-specific) compared to predictions from our model that is based on proteins with experimentally determined function.

Project description:The prediction of biological targets of bioactive molecules from machine-readable materials can be routinely performed by computational target prediction tools (CTPTs). However, the prediction of biological targets of bioactive molecules from non-digital materials (e.g., printed or handwritten documents) has not been possible due to the complex nature of bioactive molecules and impossibility of employing computations. Improving the target prediction accuracy is the most important challenge for computational target prediction. A minimum structure is identified for each group of neighbor molecules in the proposed method. Each group of neighbor molecules represents a distinct structural class of molecules with the same function in relation to the target. The minimum structure is employed as a query to search for molecules that perfectly satisfy the minimum structure of what is guessed crucial for the targeted activity. The proposed method is based on chemical similarity, but only molecules that perfectly satisfy the minimum structure are considered. Structurally related bioactive molecules found with the same minimum structure were considered as neighbor molecules of the query molecule. The known target of the neighbor molecule is used as a reference for predicting the target of the neighbor molecule with an unknown target. A lot of information is needed to identify the minimum structure, because it is necessary to know which part(s) of the bioactive molecule determines the precise target or targets responsible for the observed phenotype. Therefore, the predicted target based on the minimum structure without employing the statistical significance is considered as a reliable prediction. Since only molecules that perfectly (and not partly) satisfy the minimum structure are considered, the minimum structure can be used without similarity calculations in non-digital materials and with similarity calculations (perfect similarity) in machine-readable materials. Nine tools (PASS online, PPB, SEA, TargetHunter, PharmMapper, ChemProt, HitPick, SuperPred, and SPiDER), which can be used for computational target prediction, are compared with the proposed method for 550 target predictions. The proposed method, SEA, PPB, and PASS online, showed the best quality and quantity for the accurate predictions.

Dataset Information

Testing statistical significance scores of sequence comparison methods with structure similarity.

Publications

Testing statistical significance scores of sequence comparison methods with structure similarity.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets