Dataset Information

SAG-QC: quality control of single amplified genome information by subtracting non-target sequences based on sequence compositions.

ABSTRACT:

Background

Whole genome amplification techniques have enabled the analysis of unexplored genomic information by sequencing of single-amplified genomes (SAGs). Whole genome amplification of single bacteria is currently challenging because contamination often occurs in experimental processes. Thus, to increase the confidence in the analyses of sequenced SAGs, bioinformatics approaches that identify and exclude non-target sequences from SAGs are required. Since currently reported approaches utilize sequence information in public databases, they have limitations when new strains are the targets of interest. Here, we developed a software SAG-QC that identify and exclude non-target sequences independent of database.

Results

In our method, "no template control" sequences acquired during WGA were used. We calculated the probability that a sequence was derived from contaminants by comparing k-mer compositions with the no template control sequences. Based on the results of tests using simulated SAG datasets, the accuracy of our method for predicting non-target sequences was higher than that of currently reported techniques. Subsequently, we applied our tool to actual SAG datasets and evaluated the accuracy of the predictions.

Conclusions

Our method works independently of public sequence information for distinguishing SAGs from non-target sequences. This method will be effective when employed against SAG sequences of unexplored strains and we anticipate that it will contribute to the correct interpretation of SAGs.

SUBMITTER: Maruyama T

PROVIDER: S-EPMC5336615 | biostudies-literature | 2017 Mar

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

SAG-QC: quality control of single amplified genome information by subtracting non-target sequences based on sequence compositions.

Maruyama Toru T Mori Tetsushi T Yamagishi Keisuke K Takeyama Haruko H

BMC bioinformatics 20170304 1

<h4>Background</h4>Whole genome amplification techniques have enabled the analysis of unexplored genomic information by sequencing of single-amplified genomes (SAGs). Whole genome amplification of single bacteria is currently challenging because contamination often occurs in experimental processes. Thus, to increase the confidence in the analyses of sequenced SAGs, bioinformatics approaches that identify and exclude non-target sequences from SAGs are required. Since currently reported approaches ...[more]

PMID: 28259144

Similar Datasets

Project description:BackgroundRNA-Seq has become one of the most widely used applications based on next-generation sequencing technology. However, raw RNA-Seq data may have quality issues, which can significantly distort analytical results and lead to erroneous conclusions. Therefore, the raw data must be subjected to vigorous quality control (QC) procedures before downstream analysis. Currently, an accurate and complete QC of RNA-Seq data requires of a suite of different QC tools used consecutively, which is inefficient in terms of usability, running time, file usage, and interpretability of the results.ResultsWe developed a comprehensive, fast and easy-to-use QC pipeline for RNA-Seq data, RNA-QC-Chain, which involves three steps: (1) sequencing-quality assessment and trimming; (2) internal (ribosomal RNAs) and external (reads from foreign species) contamination filtering; (3) alignment statistics reporting (such as read number, alignment coverage, sequencing depth and pair-end read mapping information). This package was developed based on our previously reported tool for general QC of next-generation sequencing (NGS) data called QC-Chain, with extensions specifically designed for RNA-Seq data. It has several features that are not available yet in other QC tools for RNA-Seq data, such as RNA sequence trimming, automatic rRNA detection and automatic contaminating species identification. The three QC steps can run either sequentially or independently, enabling RNA-QC-Chain as a comprehensive package with high flexibility and usability. Moreover, parallel computing and optimizations are embedded in most of the QC procedures, providing a superior efficiency. The performance of RNA-QC-Chain has been evaluated with different types of datasets, including an in-house sequencing data, a semi-simulated data, and two real datasets downloaded from public database. Comparisons of RNA-QC-Chain with other QC tools have manifested its superiorities in both function versatility and processing speed.ConclusionsWe present here a tool, RNA-QC-Chain, which can be used to comprehensively resolve the quality control processes of RNA-Seq data effectively and efficiently.

Project description:We present primary results from the Sequencing Quality Control (SEQC) project, coordinated by the United States Food and Drug Administration. Examining Illumina HiSeq, Life Technologies SOLiD and Roche 454 platforms at multiple laboratory sites using reference RNA samples with built-in controls, we assess RNA sequencing (RNA-seq) performance for sequence discovery and differential expression profiling and compare it to microarray and quantitative PCR (qPCR) data using complementary metrics. At all sequencing depths, we discover unannotated exon-exon junctions, with >80% validated by qPCR. We find that measurements of relative expression are accurate and reproducible across sites and platforms if specific filters are used. In contrast, RNA-seq and microarrays do not provide accurate absolute measurements, and gene-specific biases are observed, for these and qPCR. Measurement performance depends on the platform and data analysis pipeline, and variation is large for transcriptlevel profiling. The complete SEQC data sets, comprising >100 billion reads (10Tb), provide unique resources for evaluating RNA-seq analyses for clinical and regulatory settings. The well-characterized reference RNA samples A (pooled cell lines) and B (human brain) from the MAQC consortium, adding spike-ins of synthetic RNA from the External RNA Control Consortium (ERCC). Samples C and D were then constructed by combining A and B in known mixing ratios, 3:1 and 1:3, respectively. All samples were distributed to several independent sites for RNA-Seq library construction and profiling by Illumina HiSeq 2000 and LifeTech SOLiD 5500 platforms. Also, vendors created their own cDNA libraries that were then distributed to each test site, in order to examine the degree of a M-bM-^@M-^\site effectM-bM-^@M-^] that was independent of the library preparation process. To support an assessment of gene models, samples A and B were also sequenced at independent sites by the Roche 454 platform, providing longer reads. For comparison to other technologies, these data were also compared to the MAQC-I Affymetrix U133 Plus2 microarray, several current microarray platforms, and also assessed by 20,801 PrimePCR reactions.

Dataset Information

SAG-QC: quality control of single amplified genome information by subtracting non-target sequences based on sequence compositions.

Background

Results

Conclusions

Publications

SAG-QC: quality control of single amplified genome information by subtracting non-target sequences based on sequence compositions.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets