Dataset Information

Calculating the quality of public high-throughput sequencing data to obtain a suitable subset for reanalysis from the Sequence Read Archive.

ABSTRACT: It is important for public data repositories to promote the reuse of archived data. In the growing field of omics science, however, the increasing number of submissions of high-throughput sequencing (HTSeq) data to public repositories prevents users from choosing a suitable data set from among the large number of search results. Repository users need to be able to set a threshold to reduce the number of results to obtain a suitable subset of high-quality data for reanalysis. We calculated the quality of sequencing data archived in a public data repository, the Sequence Read Archive (SRA), by using the quality control software FastQC. We obtained quality values for 1 171 313 experiments, which can be used to evaluate the suitability of data for reuse. We also visualized the data distribution in SRA by integrating the quality information and metadata of experiments and samples. We provide quality information of all of the archived sequencing data, which enable users to obtain sufficient quality sequencing data for reanalyses. The calculated quality data are available to the public in various formats. Our data also provide an example of enhancing the reuse of public data by adding metadata to published research data by a third party.

SUBMITTER: Ohta T

PROVIDER: S-EPMC5459929 | biostudies-literature | 2017 Jun

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Calculating the quality of public high-throughput sequencing data to obtain a suitable subset for reanalysis from the Sequence Read Archive.

Ohta Tazro T Nakazato Takeru T Bono Hidemasa H

GigaScience 20170601 6

It is important for public data repositories to promote the reuse of archived data. In the growing field of omics science, however, the increasing number of submissions of high-throughput sequencing (HTSeq) data to public repositories prevents users from choosing a suitable data set from among the large number of search results. Repository users need to be able to set a threshold to reduce the number of results to obtain a suitable subset of high-quality data for reanalysis. We calculated the qu ...[more]

PMID: 28449062

Dataset Information

Calculating the quality of public high-throughput sequencing data to obtain a suitable subset for reanalysis from the Sequence Read Archive.

Publications

Calculating the quality of public high-throughput sequencing data to obtain a suitable subset for reanalysis from the Sequence Read Archive.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

Experimental design-based functional mining and characterization of high-throughput sequencing data in the sequence read archive.
| S-EPMC3805581 | biostudies-literature

Extracting allelic read counts from 250,000 human sequencing runs in Sequence Read Archive.
| S-EPMC6415672 | biostudies-literature

VARUS: sampling complementary RNA reads from the sequence read archive.
| S-EPMC6842140 | biostudies-literature

MetaSRA: normalized human sample-specific metadata for the Sequence Read Archive.
| S-EPMC5870770 | biostudies-literature

Remapping the SRA: Drosophila melanogaster RNA-Seq data from the Sequence Read Archive
2018-07-18 | GSE117217 | GEO

Jupyter notebook-based tools for building structured datasets from the Sequence Read Archive.
| S-EPMC7445559 | biostudies-literature

STAT: a fast, scalable, MinHash-based k-mer tool to assess Sequence Read Archive next-generation sequence submissions.
| S-EPMC8450716 | biostudies-literature

PARTIE: a partition engine to separate metagenomic and amplicon projects in the Sequence Read Archive.
| S-EPMC5860118 | biostudies-literature

pysradb: A Python package to query next-generation sequencing metadata and data from NCBI Sequence Read Archive.
| S-EPMC6505635 | biostudies-literature