Dataset Information

SAMQA: error classification and validation of high-throughput sequenced read data.

ABSTRACT:

Background

The advances in high-throughput sequencing technologies and growth in data sizes has highlighted the need for scalable tools to perform quality assurance testing. These tests are necessary to ensure that data is of a minimum necessary standard for use in downstream analysis. In this paper we present the SAMQA tool to rapidly and robustly identify errors in population-scale sequence data.

Results

SAMQA has been used on samples from three separate sets of cancer genome data from The Cancer Genome Atlas (TCGA) project. Using technical standards provided by the SAM specification and biological standards defined by researchers, we have classified errors in these sequence data sets relative to individual reads within a sample. Due to an observed linearithmic speedup through the use of a high-performance computing (HPC) framework for the majority of tasks, poor quality data was identified prior to secondary analysis in significantly less time on the HPC framework than the same data run using alternative parallelization strategies on a single server.

Conclusions

The SAMQA toolset validates a minimum set of data quality standards across whole-genome and exome sequences. It is tuned to run on a high-performance computational framework, enabling QA across hundreds gigabytes of samples regardless of coverage or sample type.

SUBMITTER: Robinson T

PROVIDER: S-EPMC3170309 | biostudies-literature | 2011 Aug

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

SAMQA: error classification and validation of high-throughput sequenced read data.

Robinson Thomas T Killcoyne Sarah S Bressler Ryan R Boyle John J

BMC genomics 20110818

<h4>Background</h4>The advances in high-throughput sequencing technologies and growth in data sizes has highlighted the need for scalable tools to perform quality assurance testing. These tests are necessary to ensure that data is of a minimum necessary standard for use in downstream analysis. In this paper we present the SAMQA tool to rapidly and robustly identify errors in population-scale sequence data.<h4>Results</h4>SAMQA has been used on samples from three separate sets of cancer genome da ...[more]

PMID: 21851633

Dataset Information

SAMQA: error classification and validation of high-throughput sequenced read data.

Background

Results

Conclusions

Publications

SAMQA: error classification and validation of high-throughput sequenced read data.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

HALC: High throughput algorithm for long read error correction.
| S-EPMC5382505 | biostudies-literature

High-Throughput Identification of Adapters in Single-Read Sequencing Data.
| S-EPMC7356586 | biostudies-literature

Statistically invalid classification of high throughput gene expression data.
| S-EPMC3551228 | biostudies-other

Identification and correction of systematic error in high-throughput sequence data.
| S-EPMC3295828 | biostudies-literature

Biofoundry-Scale DNA Assembly Validation Using Cost-Effective High-Throughput Long-Read Sequencing.
| S-EPMC10877595 | biostudies-literature

Accurate human microsatellite genotypes from high-throughput resequencing data using informed error profiles.
| S-EPMC3592458 | biostudies-other

Substantial biases in ultra-short read data sets from high-throughput DNA sequencing.
| S-EPMC2532726 | biostudies-literature

Detecting Expansions of Tandem Repeats in Cohorts Sequenced with Short-Read Sequencing Data.
| S-EPMC6288141 | biostudies-literature

Bazam: a rapid method for read extraction and realignment of high-throughput sequencing data.
| S-EPMC6472072 | biostudies-literature

PathoQC: Computationally Efficient Read Preprocessing and Quality Control for High-Throughput Sequencing Data Sets.
| S-EPMC4429651 | biostudies-literature