Unknown

Dataset Information

0

Testing statistical significance scores of sequence comparison methods with structure similarity.


ABSTRACT: BACKGROUND: In the past years the Smith-Waterman sequence comparison algorithm has gained popularity due to improved implementations and rapidly increasing computing power. However, the quality and sensitivity of a database search is not only determined by the algorithm but also by the statistical significance testing for an alignment. The e-value is the most commonly used statistical validation method for sequence database searching. The CluSTr database and the Protein World database have been created using an alternative statistical significance test: a Z-score based on Monte-Carlo statistics. Several papers have described the superiority of the Z-score as compared to the e-value, using simulated data. We were interested if this could be validated when applied to existing, evolutionary related protein sequences. RESULTS: All experiments are performed on the ASTRAL SCOP database. The Smith-Waterman sequence comparison algorithm with both e-value and Z-score statistics is evaluated, using ROC, CVE and AP measures. The BLAST and FASTA algorithms are used as reference. We find that two out of three Smith-Waterman implementations with e-value are better at predicting structural similarities between proteins than the Smith-Waterman implementation with Z-score. SSEARCH especially has very high scores. CONCLUSION: The compute intensive Z-score does not have a clear advantage over the e-value. The Smith-Waterman implementations give generally better results than their heuristic counterparts. We recommend using the SSEARCH algorithm combined with e-values for pairwise sequence comparisons.

SUBMITTER: Hulsen T 

PROVIDER: S-EPMC1618413 | biostudies-other | 2006

REPOSITORIES: biostudies-other

altmetric image

Publications

Testing statistical significance scores of sequence comparison methods with structure similarity.

Hulsen Tim T   de Vlieg Jacob J   Leunissen Jack A M JA   Groenen Peter M A PM  

BMC bioinformatics 20061012


<h4>Background</h4>In the past years the Smith-Waterman sequence comparison algorithm has gained popularity due to improved implementations and rapidly increasing computing power. However, the quality and sensitivity of a database search is not only determined by the algorithm but also by the statistical significance testing for an alignment. The e-value is the most commonly used statistical validation method for sequence database searching. The CluSTr database and the Protein World database hav  ...[more]

Similar Datasets

| S-EPMC34495 | biostudies-literature
| S-EPMC516024 | biostudies-literature
| S-EPMC6348690 | biostudies-literature
| S-EPMC2760442 | biostudies-literature
| S-EPMC6472439 | biostudies-literature
| S-EPMC2151544 | biostudies-literature
| S-EPMC4990825 | biostudies-other
| S-EPMC1569881 | biostudies-literature
| S-EPMC4304842 | biostudies-literature
| S-EPMC6676798 | biostudies-literature