Dataset Information

Variational infinite heterogeneous mixture model for semi-supervised clustering of heart enhancers.

ABSTRACT:

Motivation

Mammalian genomes can contain thousands of enhancers but only a subset are actively driving gene expression in a given cellular context. Integrated genomic datasets can be harnessed to predict active enhancers. One challenge in integration of large genomic datasets is the increasing heterogeneity: continuous, binary and discrete features may all be relevant. Coupled with the typically small numbers of training examples, semi-supervised approaches for heterogeneous data are needed; however, current enhancer prediction methods are not designed to handle heterogeneous data in the semi-supervised paradigm.

Results

We implemented a Dirichlet Process Heterogeneous Mixture model that infers Gaussian, Bernoulli and Poisson distributions over features. We derived a novel variational inference algorithm to handle semi-supervised learning tasks where certain observations are forced to cluster together. We applied this model to enhancer candidates in mouse heart tissues based on heterogeneous features. We constrained a small number of known active enhancers to appear in the same cluster, and 47 additional regions clustered with them. Many of these are located near heart-specific genes. The model also predicted 1176 active promoters, suggesting that it can discover new enhancers and promoters.

Availability and implementation

We created the 'dphmix' Python package: https://pypi.org/project/dphmix/.

Supplementary information

Supplementary data are available at Bioinformatics online.

SUBMITTER: Mehdi TF

PROVIDER: S-EPMC6748727 | biostudies-literature | 2019 Sep

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Variational infinite heterogeneous mixture model for semi-supervised clustering of heart enhancers.

Mehdi Tahmid F TF Singh Gurdeep G Mitchell Jennifer A JA Moses Alan M AM

Bioinformatics (Oxford, England) 20190901 18

<h4>Motivation</h4>Mammalian genomes can contain thousands of enhancers but only a subset are actively driving gene expression in a given cellular context. Integrated genomic datasets can be harnessed to predict active enhancers. One challenge in integration of large genomic datasets is the increasing heterogeneity: continuous, binary and discrete features may all be relevant. Coupled with the typically small numbers of training examples, semi-supervised approaches for heterogeneous data are nee ...[more]

PMID: 30753279

Similar Datasets

Project description:BACKGROUND:Electrogram-guided ablation procedures have been proposed as an alternative strategy consisting of either mapping and ablating focal sources or targeting complex fractionated electrograms in atrial fibrillation (AF). However, the incomplete understanding of the mechanism of AF makes difficult the decision of detecting the target sites. To date, feature extraction from electrograms is carried out mostly based on the time-domain morphology analysis and non-linear features. However, their combination has been reported to achieve better performance. Besides, most of the inferring approaches applied for identifying the levels of fractionation are supervised, which lack of an objective description of fractionation. This aspect complicates their application on EGM-guided ablation procedures. METHODS:This work proposes a semi-supervised clustering method of four levels of fractionation. In particular, we make use of the spectral clustering that groups a set of widely used features extracted from atrial electrograms. We also introduce a new atrial-deflection-based feature to quantify the fractionated activity. Further, based on the sequential forward selection, we find the optimal subset that provides the highest performance in terms of the cluster validation. The method is tested on external validation of a labeled database. The generalization ability of the proposed training approach is tested to aid semi-supervised learning on unlabeled dataset associated with anatomical information recorded from three patients. RESULTS:A joint set of four extracted features, based on two time-domain morphology analysis and two non-linear dynamics, are selected. To discriminate between four considered levels of fractionation, validation on a labeled database performs a suitable accuracy (77.6 %). Results show a congruence value of internal validation index among tested patients that is enough to reconstruct the patterns over the atria to located critical sites with the benefit of avoiding previous manual classification of AF types. CONCLUSIONS:To the best knowledge of the authors, this is the first work reporting semi-supervised clustering for distinguishing patterns in fractionated electrograms. The proposed methodology provides high performance for the detection of unknown patterns associated with critical EGM morphologies. Particularly, obtained results of semi-supervised training show the advantage of demanding fewer labeled data and less training time without significantly compromising accuracy. This paper introduces a new method, providing an objective scheme that enables electro-physiologist to recognize the diverse EGM morphologies reliably.

Dataset Information

Variational infinite heterogeneous mixture model for semi-supervised clustering of heart enhancers.

Motivation

Results

Availability and implementation

Supplementary information

Publications

Variational infinite heterogeneous mixture model for semi-supervised clustering of heart enhancers.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets