Browse
Submit Data
Databases
API
Help

Dataset Information

6 Views

0 Connections

0 Citations

0 Reanalyses

0 Downloads

Omics score: 0

Hypercluster: a flexible tool for parallelized unsupervised clustering optimization

ABSTRACT: Background Unsupervised clustering is a common and exceptionally useful tool for large biological datasets. However, clustering requires upfront algorithm and hyperparameter selection, which can introduce bias into the final clustering labels. It is therefore advisable to obtain a range of clustering results from multiple models and hyperparameters, which can be cumbersome and slow. Results We present hypercluster, a python package and SnakeMake pipeline for flexible and parallelized clustering evaluation and selection. Users can efficiently evaluate a huge range of clustering results from multiple models and hyperparameters to identify an optimal model. Conclusions Hypercluster improves ease of use, robustness and reproducibility for unsupervised clustering application for high throughput biology. Hypercluster is available on pip and bioconda; installation, documentation and example workflows can be found at: https://github.com/ruggleslab/hypercluster.

SUBMITTER: Blumenberg L

PROVIDER: S-EPMC7525959 | biostudies-literature | 2020 Jan

REPOSITORIES: biostudies-literature

ACCESS DATA

Json Xml

Similar Datasets

Unsupervised clustering and epigenetic classification of single cells

Project description:96 scATAC-seq samples generated for the retinoic acid-induced mESC differentiation at day 4.

2018-04-20 | GSE107651 | GEO

An unsupervised neuromorphic clustering algorithm.

Project description:Brains perform complex tasks using a fraction of the power that would be required to do the same on a conventional computer. New neuromorphic hardware systems are now becoming widely available that are intended to emulate the more power efficient, highly parallel operation of brains. However, to use these systems in applications, we need "neuromorphic algorithms" that can run on them. Here we develop a spiking neural network model for neuromorphic hardware that uses spike timing-dependent plasticity and lateral inhibition to perform unsupervised clustering. With this model, time-invariant, rate-coded datasets can be mapped into a feature space with a specified resolution, i.e., number of clusters, using exclusively neuromorphic hardware. We developed and tested implementations on the SpiNNaker neuromorphic system and on GPUs using the GeNN framework. We show that our neuromorphic clustering algorithm achieves results comparable to those of conventional clustering algorithms such as self-organizing maps, neural gas or k-means clustering. We then combine it with a previously reported supervised neuromorphic classifier network to demonstrate its practical use as a neuromorphic preprocessing module.

| S-EPMC6658584 | biostudies-literature

Optical Pushing: A Tool for Parallelized Biomolecule Manipulation.

Project description:The ability to measure and manipulate single molecules has greatly advanced the field of biophysics. Yet, the addition of more single-molecule tools that enable one to measure in a parallel fashion is important to diversify the questions that can be addressed. Here we present optical pushing (OP), a single-molecule technique that is used to exert forces on many individual biomolecules tethered to microspheres using a single collimated laser beam. Forces ranging from a few femtoNewtons to several picoNewtons can be applied with a submillisecond response time. To determine forces exerted on the tethered particles by the laser, we analyzed their measured Brownian motion using, to our knowledge, a newly derived analytical model and numerical simulations. In the model, Brownian rotation of the microspheres is taken into account, which proved to be a critical component to correctly determine the applied forces. We used our OP technique to map the energy landscape of the protein-induced looping dynamics of DNA. OP can be used to apply loading rates in the range of 10(-4)-10(6) pN/s to many molecules at the same time, which makes it a tool suitable for dynamic force spectroscopy.

| S-EPMC4805865 | biostudies-other

Unsupervised statistical clustering of environmental shotgun sequences.

Project description:BACKGROUND: The development of effective environmental shotgun sequence binning methods remains an ongoing challenge in algorithmic analysis of metagenomic data. While previous methods have focused primarily on supervised learning involving extrinsic data, a first-principles statistical model combined with a self-training fitting method has not yet been developed. RESULTS: We derive an unsupervised, maximum-likelihood formalism for clustering short sequences by their taxonomic origin on the basis of their k-mer distributions. The formalism is implemented using a Markov Chain Monte Carlo approach in a k-mer feature space. We introduce a space transformation that reduces the dimensionality of the feature space and a genomic fragment divergence measure that strongly correlates with the method's performance. Pairwise analysis of over 1000 completely sequenced genomes reveals that the vast majority of genomes have sufficient genomic fragment divergence to be amenable for binning using the present formalism. Using a high-performance implementation, the binner is able to classify fragments as short as 400 nt with accuracy over 90% in simulations of low-complexity communities of 2 to 10 species, given sufficient genomic fragment divergence. The method is available as an open source package called LikelyBin. CONCLUSION: An unsupervised binning method based on statistical signatures of short environmental sequences is a viable stand-alone binning method for low complexity samples. For medium and high complexity samples, we discuss the possibility of combining the current method with other methods as part of an iterative process to enhance the resolving power of sorting reads into taxonomic and/or functional bins.

| S-EPMC2765972 | biostudies-literature

Unsupervised ranking of clustering algorithms by INFOMAX.

Project description:Clustering and community detection provide a concise way of extracting meaningful information from large datasets. An ever growing plethora of data clustering and community detection algorithms have been proposed. In this paper, we address the question of ranking the performance of clustering algorithms for a given dataset. We show that, for hard clustering and community detection, Linsker's Infomax principle can be used to rank clustering algorithms. In brief, the algorithm that yields the highest value of the entropy of the partition, for a given number of clusters, is the best one. We show indeed, on a wide range of datasets of various sizes and topological structures, that the ranking provided by the entropy of the partition over a variety of partitioning algorithms is strongly correlated with the overlap with a ground truth partition The codes related to the project are available in https://github.com/Sandipan99/Ranking_cluster_algorithms.

| S-EPMC7588117 | biostudies-literature

FastMEDUSA: a parallelized tool to infer gene regulatory networks.

Project description:In order to construct gene regulatory networks of higher organisms from gene expression and promoter sequence data efficiently, we developed FastMEDUSA. In this parallelized version of the regulatory network-modeling tool MEDUSA, expression and sequence data are shared among a user-defined number of processors on a single multi-core machine or cluster. Our results show that FastMEDUSA allows a more efficient utilization of computational resources. While the determination of a regulatory network of brain tumor in Homo sapiens takes 12 days with MEDUSA, FastMEDUSA obtained the same results in 6 h by utilizing 100 processors.Source code and documentation of FastMEDUSA are available at https://wiki.nci.nih.gov/display/NOBbioinf/FastMEDUSA

| S-EPMC2894517 | biostudies-other

Unsupervised hierarchical clustering of iNPCs induced by 6 or 5 TFs

2011-06-07 | GSE29724 | GEO

Unsupervised hierarchical clustering of iHPCs induced by 9 or 10 TFs

2011-06-07 | GSE29730 | GEO

Unsupervised hierarchical clustering of iHPCs induced by 9 or 10 TFs

Project description:To clarify the gene expression profile of iHep, microarray analysis was performed using iHeps induced by 10 TFs (Foxg1, Lcor, Hnf3b, Hnf4a, Foxo6, Cdx2, Tcf1, Foxa3 ,Tcf2, Onecut1) and 9 TFs (Onecut1 was omitted from 10 TFs). Unsupervised hierarchical clustering indicated that iHep is expressing a global transcriptional profile more similar to that of HPCs rather than that of NPCs, and suggested that TFs present in the pool acted as inducing TFs. HPC (HB1 and HNG2) were established from fetal liver (E13.5) of C57BL6J and STOCK Tg(Nanog-GFP, Puro)1 Yam, respectively. iHeps were induced from NPC (NSBAg2, established from an ES cell line BAg73C2 carrying beta-geo knock-in allele in Afp) using retroviral vectors (pMXs without drug-selection markers) of 9 or 10 transcription factors. Three weeks after the infection, G418 was added and cultured for 1 week before the harvest. NSBAg2 and NSEB5-2C were used for the data of NPC. GSM396240 and GSM336010 were used for the data of ESC.

2011-06-07 | E-GEOD-29730 | biostudies-arrayexpress

Unsupervised hierarchical clustering of iNPCs induced by 6 or 5 TFs

Project description:To clarify the gene expression profile of iNPC, microarray analysis was performed using iNPCs induced by 6 TFs (Pax6, Hmga2, Etv6, Gatad2b, Nfxl1, and Esx1) and 5 TFs (Esx1 was omitted from 6 TFs). Unsupervised hierarchical clustering indicated that iNPC is expressing a global transcriptional profile more similar to that of NPCs rather than that of MEFs, and suggested that the TFs present in the pool acted as inducing TFs. iNPCs were induced from MEF using 6 or 5 transcription factors. iNPCs were induced from MEF (MEFSH, derived from mice carrying IRES-Hygro in Sox allele) using retroviral vectors (pMXs-IRESNeo) of 6 or 5 transcription factors. Four weeks after the infection, Hygromycin was added and cultured for 1 week before the harvest. NSBAg2 and NSEB5-2C were used for the data of NPC. GSM396240 and GSM336010 were used for the data of ESC.GSM651349 and GSM336011 were used for the data of MEF.

2011-06-07 | E-GEOD-29724 | biostudies-arrayexpress

OmicsDI is part of the ELIXIR infrastructure

OmicsDI is an Elixir interoperability service. Learn more ›

Tweets

OmicsDI Databases

PRIDE
PeptideAtlas
MassIVE
JPOST Repository
Physiome Model Repository

EGA
EVA
ENA
LINCS
PAXDB
Cell Collective

MetaboLights
Metabolomics Workbench
MetabolomeExpress
GNPS
BioModels
FAIRDOMHub

ArrayExpress
dbGaP
ExpressionAtlas
GEO
NODE

Information

Databases
Help
API
Contact us
Code on GitHub
Terms of use
Submit Data