Dataset Information

Human contamination in bacterial genomes has created thousands of spurious proteins.

ABSTRACT: Contaminant sequences that appear in published genomes can cause numerous problems for downstream analyses, particularly for evolutionary studies and metagenomics projects. Our large-scale scan of complete and draft bacterial and archaeal genomes in the NCBI RefSeq database reveals that 2250 genomes are contaminated by human sequence. The contaminant sequences derive primarily from high-copy human repeat regions, which themselves are not adequately represented in the current human reference genome, GRCh38. The absence of the sequences from the human assembly offers a likely explanation for their presence in bacterial assemblies. In some cases, the contaminating contigs have been erroneously annotated as containing protein-coding sequences, which over time have propagated to create spurious protein "families" across multiple prokaryotic and eukaryotic genomes. As a result, 3437 spurious protein entries are currently present in the widely used nr and TrEMBL protein databases. We report here an extensive list of contaminant sequences in bacterial genome assemblies and the proteins associated with them. We found that nearly all contaminants occurred in small contigs in draft genomes, which suggests that filtering out small contigs from draft genome assemblies may mitigate the issue of contamination while still keeping nearly all of the genuine genomic sequences.

SUBMITTER: Breitwieser FP

PROVIDER: S-EPMC6581058 | biostudies-literature | 2019 Jun

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Human contamination in bacterial genomes has created thousands of spurious proteins.

Breitwieser Florian P FP Pertea Mihaela M Zimin Aleksey V AV Salzberg Steven L SL

Genome research 20190507 6

Contaminant sequences that appear in published genomes can cause numerous problems for downstream analyses, particularly for evolutionary studies and metagenomics projects. Our large-scale scan of complete and draft bacterial and archaeal genomes in the NCBI RefSeq database reveals that 2250 genomes are contaminated by human sequence. The contaminant sequences derive primarily from high-copy human repeat regions, which themselves are not adequately represented in the current human reference geno ...[more]

PMID: 31064768

Dataset Information

Human contamination in bacterial genomes has created thousands of spurious proteins.

Publications

Human contamination in bacterial genomes has created thousands of spurious proteins.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

Thousands of missed genes found in bacterial genomes and their analysis with COMBREX.
| S-EPMC3534567 | biostudies-literature

Butler enables rapid cloud-based analysis of thousands of human genomes.
| S-EPMC7062635 | biostudies-literature

Genotype imputation with thousands of genomes.
| S-EPMC3276165 | biostudies-literature

Transposable elements have contributed to thousands of human proteins.
| S-EPMC1413650 | biostudies-literature

Themisto: a scalable colored k-mer index for sensitive pseudoalignment against hundreds of thousands of bacterial genomes.
| S-EPMC10311346 | biostudies-literature

Thousands of Qatari genomes inform human migration history and improve imputation of Arab haplotypes.
| S-EPMC8511259 | biostudies-literature

Using reference-free compressed data structures to analyze sequencing reads from thousands of human genomes.
| S-EPMC5287235 | biostudies-literature

Sequencing thousands of single-cell genomes with combinatorial indexing.
| S-EPMC5908213 | biostudies-literature

Thousands of Novel Endolysins Discovered in Uncultured Phage Genomes.
| S-EPMC5968864 | biostudies-literature

IMG-ABC: new features for bacterial secondary metabolism analysis and targeted biosynthetic gene cluster discovery in thousands of microbial genomes.
| S-EPMC5210574 | biostudies-literature