Dataset Information

Strand-seq enables reliable separation of long reads by chromosome via expectation maximization.

ABSTRACT:

Motivation

Current sequencing technologies are able to produce reads orders of magnitude longer than ever possible before. Such long reads have sparked a new interest in de novo genome assembly, which removes reference biases inherent to re-sequencing approaches and allows for a direct characterization of complex genomic variants. However, even with latest algorithmic advances, assembling a mammalian genome from long error-prone reads incurs a significant computational burden and does not preclude occasional misassemblies. Both problems could potentially be mitigated if assembly could commence for each chromosome separately.

Results

To address this, we show how single-cell template strand sequencing (Strand-seq) data can be leveraged for this purpose. We introduce a novel latent variable model and a corresponding Expectation Maximization algorithm, termed SaaRclust, and demonstrates its ability to reliably cluster long reads by chromosome. For each long read, this approach produces a posterior probability distribution over all chromosomes of origin and read directionalities. In this way, it allows to assess the amount of uncertainty inherent to sparse Strand-seq data on the level of individual reads. Among the reads that our algorithm confidently assigns to a chromosome, we observed more than 99% correct assignments on a subset of Pacific Bioscience reads with 30.1× coverage. To our knowledge, SaaRclust is the first approach for the in silico separation of long reads by chromosome prior to assembly.

Availability and implementation

https://github.com/daewoooo/SaaRclust.

SUBMITTER: Ghareghani M

PROVIDER: S-EPMC6022540 | biostudies-literature | 2018 Jul

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Strand-seq enables reliable separation of long reads by chromosome via expectation maximization.

Ghareghani Maryam M Porubskỳ David D Sanders Ashley D AD Meiers Sascha S Eichler Evan E EE Korbel Jan O JO Marschall Tobias T

Bioinformatics (Oxford, England) 20180701 13

<h4>Motivation</h4>Current sequencing technologies are able to produce reads orders of magnitude longer than ever possible before. Such long reads have sparked a new interest in de novo genome assembly, which removes reference biases inherent to re-sequencing approaches and allows for a direct characterization of complex genomic variants. However, even with latest algorithmic advances, assembling a mammalian genome from long error-prone reads incurs a significant computational burden and does no ...[more]

PMID: 29949971

Similar Datasets

Project description:Genome-wide nucleosome profiles are predominantly characterized using MNase-seq, which involves extensive MNase digestion and size selection to enrich for mononucleosome-sized fragments. Most available MNase-seq analysis packages assume that nucleosomes uniformly protect 147 bp DNA fragments. However, some nucleosomes with atypical histone or chemical compositions protect shorter lengths of DNA. The rigid assumptions imposed by current nucleosome analysis packages potentially prevent investigators from understanding the regulatory roles played by atypical nucleosomes. To enable the characterization of different nucleosome types from MNase-seq data, we introduce the size-based expectation maximization (SEM) nucleosome-calling package. SEM employs a hierarchical Gaussian mixture model to estimate nucleosome positions and subtypes. Nucleosome subtypes are automatically identified based on the distribution of protected DNA fragments. Benchmark analysis indicates that SEM is on par with existing packages in terms of standard nucleosome-calling accuracy metrics, while uniquely providing the ability to characterize nucleosome subtype identities. Applying SEM to a low-dose MNase-H2B-ChIP-seq data set from mouse embryonic stem cells, we identified three nucleosome types: short-fragment nucleosomes, canonical nucleosomes, and di-nucleosomes. Short-fragment nucleosomes can be divided further into two subtypes based on their chromatin accessibility. Short-fragment nucleosomes in accessible regions exhibit high MNase sensitivity and are enriched at transcription start sites (TSSs) and CTCF peaks, similar to previously reported "fragile nucleosomes." These SEM-defined accessible short-fragment nucleosomes are found not just in promoters but also in distal regulatory regions. Additional analyses reveal their colocalization with the chromatin remodelers CHD6, CHD8, and EP400. In summary, SEM provides an effective platform for exploration of nonstandard nucleosome subtypes.

Dataset Information

Strand-seq enables reliable separation of long reads by chromosome via expectation maximization.

Motivation

Results

Availability and implementation

Publications

Strand-seq enables reliable separation of long reads by chromosome via expectation maximization.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets