Unknown

Dataset Information

0

Identity: rapid alignment-free prediction of sequence alignment identity scores using self-supervised general linear models.


ABSTRACT: Pairwise global alignment is a fundamental step in sequence analysis. Optimal alignment algorithms are quadratic-slow especially on long sequences. In many applications that involve large sequence datasets, all what is needed is calculating the identity scores (percentage of identical nucleotides in an optimal alignment-including gaps-of two sequences); there is no need for visualizing how every two sequences are aligned. For these applications, we propose Identity, which produces global identity scores for a large number of pairs of DNA sequences using alignment-free methods and self-supervised general linear models. For the first time, the new tool can predict pairwise identity scores in linear time and space. On two large-scale sequence databases, Identity provided the best compromise between sensitivity and precision while being faster than BLAST, Mash, MUMmer4 and USEARCH by 2-80 times. Identity was the best performing tool when searching for low-identity matches. While constructing phylogenetic trees from about 6000 transcripts, the tree due to the scores reported by Identity was the closest to the reference tree (in contrast to andi, FSWM and Mash). Identity is capable of producing pairwise identity scores of millions-of-nucleotides-long bacterial genomes; this task cannot be accomplished by any global-alignment-based tool. Availability: https://github.com/BioinformaticsToolsmith/Identity.

SUBMITTER: Girgis HZ 

PROVIDER: S-EPMC7850047 | biostudies-literature | 2021 Mar

REPOSITORIES: biostudies-literature

altmetric image

Publications

<i>Identity</i>: rapid alignment-free prediction of sequence alignment identity scores using self-supervised general linear models.

Girgis Hani Z HZ   James Benjamin T BT   Luczak Brian B BB  

NAR genomics and bioinformatics 20210201 1


Pairwise global alignment is a fundamental step in sequence analysis. Optimal alignment algorithms are quadratic-slow especially on long sequences. In many applications that involve large sequence datasets, all what is needed is calculating the identity scores (percentage of identical nucleotides in an optimal alignment-including gaps-of two sequences); there is no need for visualizing how every two sequences are aligned. For these applications, we propose <i>Identity</i>, which produces global  ...[more]

Similar Datasets

| S-EPMC7031350 | biostudies-literature
| S-EPMC9171953 | biostudies-literature
| S-EPMC3849074 | biostudies-literature
| S-EPMC7320598 | biostudies-literature
| S-EPMC9069697 | biostudies-literature
| S-EPMC2735674 | biostudies-literature
2019-02-26 | GSE120584 | GEO
| S-EPMC6916346 | biostudies-literature
| S-EPMC5528226 | biostudies-literature
| S-EPMC3066692 | biostudies-literature