Unknown

Dataset Information

0

THiCweed: fast, sensitive detection of sequence features by clustering big datasets.


ABSTRACT: We present THiCweed, a new approach to analyzing transcription factor binding data from high-throughput chromatin immunoprecipitation-sequencing (ChIP-seq) experiments. THiCweed clusters bound regions based on sequence similarity using a divisive hierarchical clustering approach based on sequence similarity within sliding windows, while exploring both strands. ThiCweed is specially geared toward data containing mixtures of motifs, which present a challenge to traditional motif-finders. Our implementation is significantly faster than standard motif-finding programs, able to process 30 000 peaks in 1-2 h, on a single CPU core of a desktop computer. On synthetic data containing mixtures of motifs it is as accurate or more accurate than all other tested programs. THiCweed performs best with large 'window' sizes (?50 bp), much longer than typical binding sites (7-15 bp). On real data it successfully recovers literature motifs, but also uncovers complex sequence characteristics in flanking DNA, variant motifs and secondary motifs even when they occur in <5% of the input, all of which appear biologically relevant. We also find recurring sequence patterns across diverse ChIP-seq datasets, possibly related to chromatin architecture and looping. THiCweed thus goes beyond traditional motif finding to give new insights into genomic transcription factor-binding complexity.

SUBMITTER: Agrawal A 

PROVIDER: S-EPMC5861420 | biostudies-literature | 2018 Mar

REPOSITORIES: biostudies-literature

altmetric image

Publications

THiCweed: fast, sensitive detection of sequence features by clustering big datasets.

Agrawal Ankit A   Sambare Snehal V SV   Narlikar Leelavati L   Siddharthan Rahul R  

Nucleic acids research 20180301 5


We present THiCweed, a new approach to analyzing transcription factor binding data from high-throughput chromatin immunoprecipitation-sequencing (ChIP-seq) experiments. THiCweed clusters bound regions based on sequence similarity using a divisive hierarchical clustering approach based on sequence similarity within sliding windows, while exploring both strands. ThiCweed is specially geared toward data containing mixtures of motifs, which present a challenge to traditional motif-finders. Our imple  ...[more]

Similar Datasets

| S-EPMC3843501 | biostudies-literature
| S-EPMC4117525 | biostudies-literature
| S-EPMC4138177 | biostudies-literature
| S-EPMC4481955 | biostudies-literature
| S-EPMC3052304 | biostudies-literature
| S-EPMC4290913 | biostudies-literature
| S-EPMC7766091 | biostudies-literature
| S-EPMC5829143 | biostudies-literature
| S-EPMC4482057 | biostudies-literature
| S-EPMC8124657 | biostudies-literature