Dataset Information

OMAmer: tree-driven and alignment-free protein assignment to subfamilies outperforms closest sequence approaches.

ABSTRACT:

Motivation

Assigning new sequences to known protein families and subfamilies is a prerequisite for many functional, comparative and evolutionary genomics analyses. Such assignment is commonly achieved by looking for the closest sequence in a reference database, using a method such as BLAST. However, ignoring the gene phylogeny can be misleading because a query sequence does not necessarily belong to the same subfamily as its closest sequence. For example, a hemoglobin which branched out prior to the hemoglobin alpha/beta duplication could be closest to a hemoglobin alpha or beta sequence, whereas it is neither. To overcome this problem, phylogeny-driven tools have emerged but rely on gene trees, whose inference is computationally expensive.

Results

Here, we first show that in multiple animal and plant datasets, 18 to 62% of assignments by closest sequence are misassigned, typically to an over-specific subfamily. Then, we introduce OMAmer, a novel alignment-free protein subfamily assignment method, which limits over-specific subfamily assignments and is suited to phylogenomic databases with thousands of genomes. OMAmer is based on an innovative method using evolutionarily-informed k-mers for alignment-free mapping to ancestral protein subfamilies. Whilst able to reject non-homologous family-level assignments, we show that OMAmer provides better and quicker subfamily-level assignments than approaches relying on the closest sequence, whether inferred exactly by Smith-Waterman or by the fast heuristic DIAMOND.

Availability

OMAmer is available from the Python Package Index (as omamer), with the source code and a precomputed database available at https://github.com/DessimozLab/omamer.

Supplementary information

Supplementary data are available at Bioinformatics online.

SUBMITTER: Rossier V

PROVIDER: S-EPMC8479680 | biostudies-literature |

REPOSITORIES: biostudies-literature

ACCESS DATA

Dataset Information

OMAmer: tree-driven and alignment-free protein assignment to subfamilies outperforms closest sequence approaches.

Motivation

Results

Availability

Supplementary information

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

Sequence comparison alignment-free approach based on suffix tree and L-words frequency.
| S-EPMC3444837 | biostudies-literature

Instant spectral assignment for advanced decision tree-driven mass spectrometry.
| S-EPMC3365209 | biostudies-literature

Multiple alignment-free sequence comparison.
| S-EPMC3799466 | biostudies-literature

CAFE: aCcelerated Alignment-FrEe sequence analysis.
| S-EPMC5793812 | biostudies-literature

Integrating alignment-based and alignment-free sequence similarity measures for biological sequence classification.
| S-EPMC4410667 | biostudies-literature

Benchmarking of alignment-free sequence comparison methods.
| S-EPMC6659240 | biostudies-literature

Alignment-free genome tree inference by learning group-specific distance metrics.
| S-EPMC3762195 | biostudies-literature

Alignment-free sequence comparison (I): statistics and power.
| S-EPMC2818754 | biostudies-literature

Alignment-free sequence comparison: benefits, applications, and tools.
| S-EPMC5627421 | biostudies-literature

MIKE: an ultrafast, assembly-, and alignment-free approach for phylogenetic tree construction.
| S-EPMC10990684 | biostudies-literature