Dataset Information

Illumina error profiles: resolving fine-scale variation in metagenomic sequencing data.

ABSTRACT:

Background

Illumina's sequencing platforms are currently the most utilised sequencing systems worldwide. The technology has rapidly evolved over recent years and provides high throughput at low costs with increasing read-lengths and true paired-end reads. However, data from any sequencing technology contains noise and our understanding of the peculiarities and sequencing errors encountered in Illumina data has lagged behind this rapid development.

Results

We conducted a systematic investigation of errors and biases in Illumina data based on the largest collection of in vitro metagenomic data sets to date. We evaluated the Genome Analyzer II, HiSeq and MiSeq and tested state-of-the-art low input library preparation methods. Analysing in vitro metagenomic sequencing data allowed us to determine biases directly associated with the actual sequencing process. The position- and nucleotide-specific analysis revealed a substantial bias related to motifs (3mers preceding errors) ending in "GG". On average the top three motifs were linked to 16 % of all substitution errors. Furthermore, a preferential incorporation of ddGTPs was recorded. We hypothesise that all of these biases are related to the engineered polymerase and ddNTPs which are intrinsic to any sequencing-by-synthesis method. We show that quality-score-based error removal strategies can on average remove 69 % of the substitution errors - however, the motif-bias remains.

Conclusion

Single-nucleotide polymorphism changes in bacterial genomes can cause significant changes in phenotype, including antibiotic resistance and virulence, detecting them within metagenomes is therefore vital. Current error removal techniques are not designed to target the peculiarities encountered in Illumina sequencing data and other sequencing-by-synthesis methods, causing biases to persist and potentially affect any conclusions drawn from the data. In order to develop effective diagnostic and therapeutic approaches we need to be able to identify systematic sequencing errors and distinguish these errors from true genetic variation.

SUBMITTER: Schirmer M

PROVIDER: S-EPMC4787001 | biostudies-literature | 2016 Mar

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Illumina error profiles: resolving fine-scale variation in metagenomic sequencing data.

Schirmer Melanie M D'Amore Rosalinda R Ijaz Umer Z UZ Hall Neil N Quince Christopher C

BMC bioinformatics 20160311

<h4>Background</h4>Illumina's sequencing platforms are currently the most utilised sequencing systems worldwide. The technology has rapidly evolved over recent years and provides high throughput at low costs with increasing read-lengths and true paired-end reads. However, data from any sequencing technology contains noise and our understanding of the peculiarities and sequencing errors encountered in Illumina data has lagged behind this rapid development.<h4>Results</h4>We conducted a systematic ...[more]

PMID: 26968756

Similar Datasets

Project description:Genetic variation amongst individual humans occurs on many different scales, ranging from gross alterations in the human karyotype to single-nucleotide changes. In this manuscript we explore variation on an intermediate scale-particularly insertions, deletions, and inversions affecting from a few thousand to a few million base pairs. We employed a clone-based method to interrogate this intermediate structural variation in eight individuals of diverse geographic ancestry. Our analysis provides a comprehensive overview of the normal pattern of structural variation present in these genomes, refining the location of 1695 structural variants. We find that 50% were seen in more than one individual and that nearly half lay outside regions of the genome previously described as structurally variant. We discover 525 new insertion sequences that are not present in the human reference genome and show that many of these are variable in copy number among individuals. Sequencing of a subset of structural variants reveals considerable locus complexity and provides insights into the different mutational processes that have shaped the human genome. These data provide the first high-resolution sequence-map of human structural variation-an important standard for genotyping platforms and a prelude to future individual genome sequencing projects. Keywords: comparative genomic hybridization The DNA samples are a panel of 8 Hapmap samples, described by E. Eichler et al. (2007, Nature 447, 161-165). This set of 7 female, and one male samples are from from the Coriell Cell Repository, and is comprised of samples from four populations: four Yoruban, two CEPH, one Chinese, and one Japanese. The reference sample, NA15510, is female and also from the Corriel Cell Repository. This sample has been extensively characterized, (for example in Tuzan et al. 2005, Nature Genetics 10, p1038) and has been recommended for use in CNV detection programs to allow meaningful comparison of data between studies (discussed in Scherer, et al. 2007, Nature Genetics Supplement 39: S7-S15). Each of these samples was hybridized in pairs with the reversed labeling polarities. Additionally, 3 self-self control hybridizations were carried out for the reference sample, NA15510, one on each hybridization date.

Dataset Information

Illumina error profiles: resolving fine-scale variation in metagenomic sequencing data.

Background

Results

Conclusion

Publications

Illumina error profiles: resolving fine-scale variation in metagenomic sequencing data.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets