Dataset Information

Common contaminants in next-generation sequencing that hinder discovery of low-abundance microbes.

ABSTRACT: Unbiased high-throughput sequencing of whole metagenome shotgun DNA libraries is a promising new approach to identifying microbes in clinical specimens, which, unlike other techniques, is not limited to known sequences. Unlike most sequencing applications, it is highly sensitive to laboratory contaminants as these will appear to originate from the clinical specimens. To assess the extent and diversity of sequence contaminants, we aligned 57 "1000 Genomes Project" sequencing runs from six centers against the four largest NCBI BLAST databases, detecting reads of diverse contaminant species in all runs and identifying the most common of these contaminant genera (Bradyrhizobium) in assembled genomes from the NCBI Genome database. Many of these microorganisms have been reported as contaminants of ultrapure water systems. Studies aiming to identify novel microbes in clinical specimens will greatly benefit from not only preventive measures such as extensive UV irradiation of water and cross-validation using independent techniques, but also a concerted effort to sequence the complete genomes of common contaminants so that they may be subtracted computationally.

SUBMITTER: Laurence M

PROVIDER: S-EPMC4023998 | biostudies-literature | 2014

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Common contaminants in next-generation sequencing that hinder discovery of low-abundance microbes.

Laurence Martin M Hatzis Christos C Brash Douglas E DE

PloS one 20140516 5

Unbiased high-throughput sequencing of whole metagenome shotgun DNA libraries is a promising new approach to identifying microbes in clinical specimens, which, unlike other techniques, is not limited to known sequences. Unlike most sequencing applications, it is highly sensitive to laboratory contaminants as these will appear to originate from the clinical specimens. To assess the extent and diversity of sequence contaminants, we aligned 57 "1000 Genomes Project" sequencing runs from six centers ...[more]

PMID: 24837716

Similar Datasets

Project description:Sequences up to several megabases in length have been found to be present in individual genomes but absent in the human reference genome. These sequences may be common in populations, and their absence in the reference genome may indicate rare variants in the genomes of individuals who served as donors for the human genome project. As the reference genome is used in probe design for microarray technology and mapping short reads in next generation sequencing (NGS), this missing sequence could be a source of bias in functional genomic studies and variant analysis. One End Anchor (OEA) and/or orphan reads from paired-end sequencing have been used to identify novel sequences that are absent in reference genome. However, there is no study to investigate the distribution, evolution and functionality of those sequences in human populations.To systematically identify and study the missing common sequences (micSeqs), we extended the previous method by pooling OEA reads from large number of individuals and applying strict filtering methods to remove false sequences. The pipeline was applied to data from phase 1 of the 1000 Genomes Project. We identified 309 micSeqs that are present in at least 1% of the human population, but absent in the reference genome. We confirmed 76% of these 309 micSeqs by comparison to other primate genomes, individual human genomes, and gene expression data. Furthermore, we randomly selected fifteen micSeqs and confirmed their presence using PCR validation in 38 additional individuals. Functional analysis using published RNA-seq and ChIP-seq data showed that eleven micSeqs are highly expressed in human brain and three micSeqs contain transcription factor (TF) binding regions, suggesting they are functional elements. In addition, the identified micSeqs are absent in non-primates and show dynamic acquisition during primate evolution culminating with most micSeqs being present in Africans, suggesting some micSeqs may be important sources of human diversity.76% of micSeqs were confirmed by a comparative genomics approach. Fourteen micSeqs are expressed in human brain or contain TF binding regions. Some micSeqs are primate-specific, conserved and may play a role in the evolution of primates.

Project description:To gain insight into the functional antibody repertoire of rabbits, the VH and VL repertoires of bone marrow (BM) and spleen (SP) of a naïve New Zealand White rabbit (NZW; Oryctolagus cuniculus) and that of lymphocytes collected from a NZW rabbit immunized (IM) with a 16-mer peptide were deep-sequenced. Two closely related genes, IGHV1S40 (VH1a3) and IGHV1S45 (VH4), were found to dominate (~90%) the VH repertoire of BM and SP, whereas, IGHV1S69 (VH1a1) contributed significantly (~40%) to IM. BM and SP antibodies recombined predominantly with IGHJ4. A significant proportion (~30%) of IM sequences recombined with IGHJ2. The VK repertoire was encoded by nine IGKV genes recombined with one IGKJ gene, IGKJ1. No significant bias in the VK repertoire of the BM, SP and IM samples was observed. The complementarity-determining region (CDR)-H3 and -L3 length distributions were similar in the three samples following a Gaussian curve with average length of 12.2 ± 2.4 and 11.1 ± 1.1 amino acids, respectively. The amino acid composition of the predominant CDR-H3 and -L3 loop lengths was similar to that of humans and mice, rich in Tyr, Gly, Ser and, in some specific positions, Asp. The average number of mutations along the IGHV/KV genes was similar in BM, SP and IM; close to 12 and 15 mutations for VH and VL, respectively. A monoclonal antibody specific for the peptide used as immunogen was obtained from the IM rabbit. The CDR-H3 sequence was found in 1,559 of 61,728 (2.5%) sequences, at position 10, in the rank order of the CDR-H3 frequencies. The CDR-L3 was found in 24 of 11,215 (0.2%) sequences, ranking 102. No match was found in the BM and SP samples, indicating positive selection for the hybridoma sequence. Altogether, these findings lay foundations for engineering of rabbit V regions to enhance their potential as therapeutics, i.e., design of strategies for selection of specific rabbit V regions from NGS data mining, humanization and design of libraries for affinity maturation campaigns.

Dataset Information

Common contaminants in next-generation sequencing that hinder discovery of low-abundance microbes.

Publications

Common contaminants in next-generation sequencing that hinder discovery of low-abundance microbes.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets