Dataset Information

ClusTrack: feature extraction and similarity measures for clustering of genome-wide data sets.

ABSTRACT: Clustering is a popular technique for explorative analysis of data, as it can reveal subgroupings and similarities between data in an unsupervised manner. While clustering is routinely applied to gene expression data, there is a lack of appropriate general methodology for clustering of sequence-level genomic and epigenomic data, e.g. ChIP-based data. We here introduce a general methodology for clustering data sets of coordinates relative to a genome assembly, i.e. genomic tracks. By defining appropriate feature extraction approaches and similarity measures, we allow biologically meaningful clustering to be performed for genomic tracks using standard clustering algorithms. An implementation of the methodology is provided through a tool, ClusTrack, which allows fine-tuned clustering analyses to be specified through a web-based interface. We apply our methods to the clustering of occupancy of the H3K4me1 histone modification in samples from a range of different cell types. The majority of samples form meaningful subclusters, confirming that the definitions of features and similarity capture biological, rather than technical, variation between the genomic tracks. Input data and results are available, and can be reproduced, through a Galaxy Pages document at http://hyperbrowser.uio.no/hb/u/hb-superuser/p/clustrack. The clustering functionality is available as a Galaxy tool, under the menu option "Specialized analyzis of tracks", and the submenu option "Cluster tracks based on genome level similarity", at the Genomic HyperBrowser server: http://hyperbrowser.uio.no/hb/.

SUBMITTER: Rydbeck H

PROVIDER: S-EPMC4400084 | biostudies-literature | 2015

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

ClusTrack: feature extraction and similarity measures for clustering of genome-wide data sets.

Rydbeck Halfdan H Sandve Geir Kjetil GK Ferkingstad Egil E Simovski Boris B Rye Morten M Hovig Eivind E

PloS one 20150416 4

Clustering is a popular technique for explorative analysis of data, as it can reveal subgroupings and similarities between data in an unsupervised manner. While clustering is routinely applied to gene expression data, there is a lack of appropriate general methodology for clustering of sequence-level genomic and epigenomic data, e.g. ChIP-based data. We here introduce a general methodology for clustering data sets of coordinates relative to a genome assembly, i.e. genomic tracks. By defining app ...[more]

PMID: 25879845

Dataset Information

ClusTrack: feature extraction and similarity measures for clustering of genome-wide data sets.

Publications

ClusTrack: feature extraction and similarity measures for clustering of genome-wide data sets.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

Similarity searches in genome-wide numerical data sets.
| S-EPMC1489924 | biostudies-literature

Whole Genome Mapping with Feature Sets from High-Throughput Sequencing Data.
| S-EPMC5017645 | biostudies-literature

Annotation-based feature extraction from sets of SBML models.
| S-EPMC4405863 | biostudies-literature

Rethinking Measures of Functional Connectivity via Feature Extraction.
| S-EPMC6987226 | biostudies-literature

Feature identification in time series data sets.
| S-EPMC6536425 | biostudies-literature

Clustering by genetic ancestry using genome-wide SNP data.
| S-EPMC3018397 | biostudies-literature

Partitioning clustering algorithms for protein sequence data sets.
| S-EPMC2678123 | biostudies-literature

Clustering high-dimensional data via feature selection.
| S-EPMC10119907 | biostudies-literature

Data integration by fuzzy similarity-based hierarchical clustering.
| S-EPMC7446192 | biostudies-literature

Improving patient clustering by incorporating structured variable label relationships in similarity measures.
| S-EPMC11910865 | biostudies-literature