Dataset Information

TreeCluster: Clustering biological sequences using phylogenetic trees.

ABSTRACT: Clustering homologous sequences based on their similarity is a problem that appears in many bioinformatics applications. The fact that sequences cluster is ultimately the result of their phylogenetic relationships. Despite this observation and the natural ways in which a tree can define clusters, most applications of sequence clustering do not use a phylogenetic tree and instead operate on pairwise sequence distances. Due to advances in large-scale phylogenetic inference, we argue that tree-based clustering is under-utilized. We define a family of optimization problems that, given an arbitrary tree, return the minimum number of clusters such that all clusters adhere to constraints on their heterogeneity. We study three specific constraints, limiting (1) the diameter of each cluster, (2) the sum of its branch lengths, or (3) chains of pairwise distances. These three problems can be solved in time that increases linearly with the size of the tree, and for two of the three criteria, the algorithms have been known in the theoretical computer scientist literature. We implement these algorithms in a tool called TreeCluster, which we test on three applications: OTU clustering for microbiome data, HIV transmission clustering, and divide-and-conquer multiple sequence alignment. We show that, by using tree-based distances, TreeCluster generates more internally consistent clusters than alternatives and improves the effectiveness of downstream applications. TreeCluster is available at https://github.com/niemasd/TreeCluster.

SUBMITTER: Balaban M

PROVIDER: S-EPMC6705769 | biostudies-literature | 2019

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

TreeCluster: Clustering biological sequences using phylogenetic trees.

Balaban Metin M Moshiri Niema N Mai Uyen U Jia Xingfan X Mirarab Siavash S

PloS one 20190822 8

Clustering homologous sequences based on their similarity is a problem that appears in many bioinformatics applications. The fact that sequences cluster is ultimately the result of their phylogenetic relationships. Despite this observation and the natural ways in which a tree can define clusters, most applications of sequence clustering do not use a phylogenetic tree and instead operate on pairwise sequence distances. Due to advances in large-scale phylogenetic inference, we argue that tree-base ...[more]

PMID: 31437182

Dataset Information

TreeCluster: Clustering biological sequences using phylogenetic trees.

Publications

TreeCluster: Clustering biological sequences using phylogenetic trees.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

AncestralClust: clustering of divergent nucleotide sequences by ancestral sequence reconstruction using phylogenetic trees.
| S-EPMC8756197 | biostudies-literature

Computational Tools for Evaluating Phylogenetic and Hierarchical Clustering Trees.
| S-EPMC7518125 | biostudies-literature

Building Phylogenetic Trees From Genome Sequences With kSNP4.
| S-EPMC10640685 | biostudies-literature

FAVITES: simultaneous simulation of transmission networks, phylogenetic trees and sequences.
| S-EPMC6931354 | biostudies-literature

Constructing phylogenetic trees using interacting pathways.
| S-EPMC3669789 | biostudies-literature

Clustering biological sequences with dynamic sequence similarity threshold.
| S-EPMC8969259 | biostudies-literature

Accurately clustering biological sequences in linear time by relatedness sorting.
| S-EPMC11001989 | biostudies-literature

Visualizing incompatibilities in phylogenetic trees using consensus outlines.
| S-EPMC10267324 | biostudies-literature

An Integrative Database of β-Lactamase Enzymes: Sequences, Structures, Functions, and Phylogenetic Trees.
| S-EPMC6496087 | biostudies-literature

Techniques for the verification of minimal phylogenetic trees illustrated with ten mammalian haemoglobin sequences.
| S-EPMC1162494 | biostudies-other