Dataset Information

Large-scale k-mer-based analysis of the informational properties of genomes, comparative genomics and taxonomy.

ABSTRACT: Information theoretic approaches are ubiquitous and effective in a wide variety of bioinformatics applications. In comparative genomics, alignment-free methods, based on short DNA words, or k-mers, are particularly powerful. We evaluated the utility of varying k-mer lengths for genome comparisons by analyzing their sequence space coverage of 5805 genomes in the KEGG GENOME database. In subsequent analyses on four k-mer lengths spanning the relevant range (11, 21, 31, 41), hierarchical clustering of 1634 genus-level representative genomes using pairwise 21- and 31-mer Jaccard similarities best recapitulated a phylogenetic/taxonomic tree of life with clear boundaries for superkingdom domains and high subtree similarity for named taxons at lower levels (family through phylum). By analyzing ~14.2M prokaryotic genome comparisons by their lowest-common-ancestor taxon levels, we detected many potential misclassification errors in a curated database, further demonstrating the need for wide-scale adoption of quantitative taxonomic classifications based on whole-genome similarity.

SUBMITTER: Bussi Y

PROVIDER: S-EPMC8516232 | biostudies-literature | 2021

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Large-scale k-mer-based analysis of the informational properties of genomes, comparative genomics and taxonomy.

Bussi Yuval Y Kapon Ruti R Reich Ziv Z

PloS one 20211014 10

Information theoretic approaches are ubiquitous and effective in a wide variety of bioinformatics applications. In comparative genomics, alignment-free methods, based on short DNA words, or k-mers, are particularly powerful. We evaluated the utility of varying k-mer lengths for genome comparisons by analyzing their sequence space coverage of 5805 genomes in the KEGG GENOME database. In subsequent analyses on four k-mer lengths spanning the relevant range (11, 21, 31, 41), hierarchical clustering ...[more]

PMID: 34648558

Similar Datasets

Project description:Most species of Papilionidae are large and beautiful ornamental butterflies. They are recognized as model organisms in ecology, evolutionary biology, genetics, and conservation biology but present numerous unresolved phylogenetic problems. Complete mitochondrial genomes (mitogenomes) have been widely used in phylogenetic studies of butterflies, but mitogenome knowledge within the family Papilionidae is limited, and its phylogeny is far from resolved. In this study, we first report the mitogenome of Byasa confusa from the subfamily Papilioninae of Papilionidae. The mitogenome of B. confusa is 15,135 bp in length and contains 13 protein-coding genes, 22 transfer RNA genes, 2 ribosomal RNA genes, and an AT-rich control region (CR), closely mirroring the genomic structure observed in related butterfly species. Comparative analysis of 77 Papilionidae mitogenomes shows gene composition and order to be identical to that of an ancestral insect, and the AT bias, Ka/Ks, and relative synonymous codon usage (RSCU) are all consistent with that of other reported butterfly mitogenomes. We conducted phylogenetic analyses using maximum-likelihood (ML) and Bayesian-inference (BI) methods, with 77 Papilionidae species as ingroups and two species of Nymphalidae and Lycaenidae as outgroups. The phylogenetic analysis indicated that B. confusa were clustered within Byasa. The phylogenetic trees show the monophyly of the subfamily Papilioninae and the tribes Leptocircini, Papilionini, and Troidini. The data supported the following relationships in tribe level on Papilioninae: (((Troidini + Papilionini) + Teinopalpini) + Leptocircini). The divergence time analysis suggests that Papilionidae originated in the late Creataceous. Overall, utilizing the largest number of Papilionidae mitogenomes sequenced to date, with the current first exploration in a phylogenetic analysis on Papilionidae (including four subfamilies), this study comprehensively reveals the mitogenome characteristics and mitogenome-based phylogeny, providing information for further studies on the mitogenome, phylogeny, evolution, and taxonomic revision of the Papilionidae family.

Dataset Information

Large-scale k-mer-based analysis of the informational properties of genomes, comparative genomics and taxonomy.

Publications

Large-scale k-mer-based analysis of the informational properties of genomes, comparative genomics and taxonomy.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets