Unknown

Dataset Information

0

A benchmark study of sequence alignment methods for protein clustering.


ABSTRACT:

Background

Protein sequence alignment analyses have become a crucial step for many bioinformatics studies during the past decades. Multiple sequence alignment (MSA) and pair-wise sequence alignment (PSA) are two major approaches in sequence alignment. Former benchmark studies revealed drawbacks of MSA methods on nucleotide sequence alignments. To test whether similar drawbacks also influence protein sequence alignment analyses, we propose a new benchmark framework for protein clustering based on cluster validity. This new framework directly reflects the biological ground truth of the application scenarios that adopt sequence alignments, and evaluates the alignment quality according to the achievement of the biological goal, rather than the comparison on sequence level only, which averts the biases introduced by alignment scores or manual alignment templates. Compared with former studies, we calculate the cluster validity score based on sequence distances instead of clustering results. This strategy could avoid the influence brought by different clustering methods thus make results more dependable.

Results

Results showed that PSA methods performed better than MSA methods on most of the BAliBASE benchmark datasets. Analyses on the 80 re-sampled benchmark datasets constructed by randomly choosing 90% of each dataset 10 times showed similar results.

Conclusions

These results validated that the drawbacks of MSA methods revealed in nucleotide level also existed in protein sequence alignment analyses and affect the accuracy of results.

SUBMITTER: Wang Y 

PROVIDER: S-EPMC6311937 | biostudies-literature |

REPOSITORIES: biostudies-literature

Similar Datasets

| S-EPMC3371027 | biostudies-literature
| S-EPMC1635699 | biostudies-literature
| S-EPMC2374782 | biostudies-literature
| S-EPMC280650 | biostudies-literature
| S-EPMC8289385 | biostudies-literature
| S-EPMC1087786 | biostudies-literature
| S-EPMC6472439 | biostudies-literature
| S-EPMC11312151 | biostudies-literature
2014-02-04 | E-GEOD-53422 | biostudies-arrayexpress
| S-EPMC6918801 | biostudies-literature