Dataset Information

A domain sequence approach to pangenomics: applications to Escherichia coli.

ABSTRACT: The study of microbial pangenomes relies on the computation of gene families, i.e. the clustering of coding sequences into groups of essentially similar genes. There is no standard approach to obtain such gene families. Ideally, the gene family computations should be robust against errors in the annotation of genes in various genomes. In an attempt to achieve this robustness, we propose to cluster sequences by their domain sequence, i.e. the ordered sequence of domains in their protein sequence. In a study of 347 genomes from Escherichia coli we find on average around 4500 proteins having hits in Pfam-A in every genome, clustering into around 2500 distinct domain sequence families in each genome. Across all genomes we find a total of 5724 such families. A binomial mixture model approach indicates this is around 95% of all domain sequences we would expect to see in E. coli in the future. A Heaps law analysis indicates the population of domain sequences is larger, but this analysis is also very sensitive to smaller changes in the computation procedure. The resolution between strains is good despite the coarse grouping obtained by domain sequence families. Clustering sequences by their ordered domain content give us domain sequence families, who are robust to errors in the gene prediction step. The computational load of the procedure scales linearly with the number of genomes, which is needed for the future explosion in the number of re-sequenced strains. The use of domain sequence families for a functional classification of strains clearly has some potential to be explored.

SUBMITTER: Snipen LG

PROVIDER: S-EPMC3901455 | biostudies-literature |

REPOSITORIES: biostudies-literature

ACCESS DATA

Dataset Information

A domain sequence approach to pangenomics: applications to Escherichia coli.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

happi: a hierarchical approach to pangenomics inference.
| S-EPMC10540326 | biostudies-literature

Draft Genome Sequence of Escherichia coli KL53.
| S-EPMC5876489 | biostudies-literature

Complete Genome Sequence of Escherichia coli ML35.
| S-EPMC5814495 | biostudies-literature

Complete Genome Sequence of Escherichia coli BW25113.
| S-EPMC4200154 | biostudies-literature

Complete Genome Sequence of Escherichia coli NCM3722.
| S-EPMC4541272 | biostudies-literature

Production and applications of fluorobody from redox-engineered Escherichia coli.
| S-EPMC10050041 | biostudies-literature

Complete Genome Sequence of Escherichia coli Siphophage BRET.
| S-EPMC6357644 | biostudies-literature

Complete Genome Sequence of Escherichia coli Phage Pisces.
| S-EPMC6763662 | biostudies-literature

Complete Genome Sequence of Escherichia coli Siphophage Schulenberg.
| S-EPMC6763661 | biostudies-literature

Complete Genome Sequence of Escherichia coli Podophage Peacock.
| S-EPMC6763663 | biostudies-literature