Unknown

Dataset Information

0

Testing statistical significance scores of sequence comparison methods with structure similarity.


ABSTRACT:

Background

In the past years the Smith-Waterman sequence comparison algorithm has gained popularity due to improved implementations and rapidly increasing computing power. However, the quality and sensitivity of a database search is not only determined by the algorithm but also by the statistical significance testing for an alignment. The e-value is the most commonly used statistical validation method for sequence database searching. The CluSTr database and the Protein World database have been created using an alternative statistical significance test: a Z-score based on Monte-Carlo statistics. Several papers have described the superiority of the Z-score as compared to the e-value, using simulated data. We were interested if this could be validated when applied to existing, evolutionary related protein sequences.

Results

All experiments are performed on the ASTRAL SCOP database. The Smith-Waterman sequence comparison algorithm with both e-value and Z-score statistics is evaluated, using ROC, CVE and AP measures. The BLAST and FASTA algorithms are used as reference. We find that two out of three Smith-Waterman implementations with e-value are better at predicting structural similarities between proteins than the Smith-Waterman implementation with Z-score. SSEARCH especially has very high scores.

Conclusion

The compute intensive Z-score does not have a clear advantage over the e-value. The Smith-Waterman implementations give generally better results than their heuristic counterparts. We recommend using the SSEARCH algorithm combined with e-values for pairwise sequence comparisons.

SUBMITTER: Hulsen T 

PROVIDER: S-EPMC1618413 | biostudies-literature | 2006 Oct

REPOSITORIES: biostudies-literature

altmetric image

Publications

Testing statistical significance scores of sequence comparison methods with structure similarity.

Hulsen Tim T   de Vlieg Jacob J   Leunissen Jack A M JA   Groenen Peter M A PM  

BMC bioinformatics 20061012


<h4>Background</h4>In the past years the Smith-Waterman sequence comparison algorithm has gained popularity due to improved implementations and rapidly increasing computing power. However, the quality and sensitivity of a database search is not only determined by the algorithm but also by the statistical significance testing for an alignment. The e-value is the most commonly used statistical validation method for sequence database searching. The CluSTr database and the Protein World database hav  ...[more]

Similar Datasets

| S-EPMC34495 | biostudies-literature
| S-EPMC516024 | biostudies-literature
| S-EPMC6348690 | biostudies-literature
| S-EPMC2760442 | biostudies-literature
| S-EPMC6472439 | biostudies-literature
| S-EPMC2151544 | biostudies-literature
| S-EPMC4990825 | biostudies-other
| S-EPMC1569881 | biostudies-literature
| S-EPMC4304842 | biostudies-literature
| S-EPMC6676798 | biostudies-literature