Unknown

Dataset Information

0

Variational infinite heterogeneous mixture model for semi-supervised clustering of heart enhancers.


ABSTRACT:

Motivation

Mammalian genomes can contain thousands of enhancers but only a subset are actively driving gene expression in a given cellular context. Integrated genomic datasets can be harnessed to predict active enhancers. One challenge in integration of large genomic datasets is the increasing heterogeneity: continuous, binary and discrete features may all be relevant. Coupled with the typically small numbers of training examples, semi-supervised approaches for heterogeneous data are needed; however, current enhancer prediction methods are not designed to handle heterogeneous data in the semi-supervised paradigm.

Results

We implemented a Dirichlet Process Heterogeneous Mixture model that infers Gaussian, Bernoulli and Poisson distributions over features. We derived a novel variational inference algorithm to handle semi-supervised learning tasks where certain observations are forced to cluster together. We applied this model to enhancer candidates in mouse heart tissues based on heterogeneous features. We constrained a small number of known active enhancers to appear in the same cluster, and 47 additional regions clustered with them. Many of these are located near heart-specific genes. The model also predicted 1176 active promoters, suggesting that it can discover new enhancers and promoters.

Availability and implementation

We created the 'dphmix' Python package: https://pypi.org/project/dphmix/.

Supplementary information

Supplementary data are available at Bioinformatics online.

SUBMITTER: Mehdi TF 

PROVIDER: S-EPMC6748727 | biostudies-literature | 2019 Sep

REPOSITORIES: biostudies-literature

altmetric image

Publications

Variational infinite heterogeneous mixture model for semi-supervised clustering of heart enhancers.

Mehdi Tahmid F TF   Singh Gurdeep G   Mitchell Jennifer A JA   Moses Alan M AM  

Bioinformatics (Oxford, England) 20190901 18


<h4>Motivation</h4>Mammalian genomes can contain thousands of enhancers but only a subset are actively driving gene expression in a given cellular context. Integrated genomic datasets can be harnessed to predict active enhancers. One challenge in integration of large genomic datasets is the increasing heterogeneity: continuous, binary and discrete features may all be relevant. Coupled with the typically small numbers of training examples, semi-supervised approaches for heterogeneous data are nee  ...[more]

Similar Datasets

2017-10-08 | GSE104714 | GEO
| S-EPMC5786324 | biostudies-literature
| S-EPMC4556708 | biostudies-literature
| S-EPMC8489729 | biostudies-literature
| S-EPMC2666814 | biostudies-other
| S-EPMC2951086 | biostudies-other
| S-EPMC4036113 | biostudies-literature
| S-EPMC4845510 | biostudies-other
| S-EPMC8294591 | biostudies-literature
| S-EPMC6548328 | biostudies-literature