Dataset Information

Extracting allelic read counts from 250,000 human sequencing runs in Sequence Read Archive.

ABSTRACT: The Sequence Read Archive (SRA) contains over one million publicly available sequencing runs from various studies using a variety of sequencing library strategies. These data inherently contain information about underlying genomic sequence variants which we exploit to extract allelic read counts on an unprecedented scale. We reprocessed over 250,000 human sequencing runs (>1000 TB data worth of raw sequence data) into a single unified dataset of allelic read counts for nearly 300,000 variants of biomedical relevance curated by NCBI dbSNP, where germline variants were detected in a median of 912 sequencing runs, and somatic variants were detected in a median of 4,876 sequencing runs, suggesting that this dataset facilitates identification of sequencing runs that harbor variants of interest. Allelic read counts obtained using a targeted alignment were very similar to read counts obtained from whole-genome alignment. Analyzing allelic read count data for matched DNA and RNA samples from tumors, we find that RNA-seq can also recover variants identified by Whole Exome Sequencing (WXS), suggesting that reprocessed allelic read counts can support variant detection across different library strategies in SRA. This study provides a rich database of known human variants across SRA samples that can support future meta-analyses of human sequence variation.

SUBMITTER: Tsui B

PROVIDER: S-EPMC6415672 | biostudies-literature | 2019

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Extracting allelic read counts from 250,000 human sequencing runs in Sequence Read Archive.

Tsui Brian B Dow Michelle M Skola Dylan D Carter Hannah H

Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing 20190101

The Sequence Read Archive (SRA) contains over one million publicly available sequencing runs from various studies using a variety of sequencing library strategies. These data inherently contain information about underlying genomic sequence variants which we exploit to extract allelic read counts on an unprecedented scale. We reprocessed over 250,000 human sequencing runs (>1000 TB data worth of raw sequence data) into a single unified dataset of allelic read counts for nearly 300,000 variants of ...[more]

PMID: 30864322

Dataset Information

Extracting allelic read counts from 250,000 human sequencing runs in Sequence Read Archive.

Publications

Extracting allelic read counts from 250,000 human sequencing runs in Sequence Read Archive.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

MetaSRA: normalized human sample-specific metadata for the Sequence Read Archive.
| S-EPMC5870770 | biostudies-literature

VARUS: sampling complementary RNA reads from the sequence read archive.
| S-EPMC6842140 | biostudies-literature

pysradb: A Python package to query next-generation sequencing metadata and data from NCBI Sequence Read Archive.
| S-EPMC6505635 | biostudies-literature

Experimental design-based functional mining and characterization of high-throughput sequencing data in the sequence read archive.
| S-EPMC3805581 | biostudies-literature

Remapping the SRA: Drosophila melanogaster RNA-Seq data from the Sequence Read Archive
2018-07-18 | GSE117217 | GEO

Human splicing diversity and the extent of unannotated splice junctions across human RNA-seq samples on the Sequence Read Archive.
| S-EPMC5203714 | biostudies-literature

Jupyter notebook-based tools for building structured datasets from the Sequence Read Archive.
| S-EPMC7445559 | biostudies-literature

Copy number variation detection using next generation sequencing read counts.
| S-EPMC4021345 | biostudies-literature

Calculating the quality of public high-throughput sequencing data to obtain a suitable subset for reanalysis from the Sequence Read Archive.
| S-EPMC5459929 | biostudies-literature

Pacybara: accurate long-read sequencing for barcoded mutagenized allelic libraries.
| S-EPMC11021806 | biostudies-literature