Dataset Information

High dimensional biological data retrieval optimization with NoSQL technology.

ABSTRACT:

Background

High-throughput transcriptomic data generated by microarray experiments is the most abundant and frequently stored kind of data currently used in translational medicine studies. Although microarray data is supported in data warehouses such as tranSMART, when querying relational databases for hundreds of different patient gene expression records queries are slow due to poor performance. Non-relational data models, such as the key-value model implemented in NoSQL databases, hold promise to be more performant solutions. Our motivation is to improve the performance of the tranSMART data warehouse with a view to supporting Next Generation Sequencing data.

Results

In this paper we introduce a new data model better suited for high-dimensional data storage and querying, optimized for database scalability and performance. We have designed a key-value pair data model to support faster queries over large-scale microarray data and implemented the model using HBase, an implementation of Google's BigTable storage system. An experimental performance comparison was carried out against the traditional relational data model implemented in both MySQL Cluster and MongoDB, using a large publicly available transcriptomic data set taken from NCBI GEO concerning Multiple Myeloma. Our new key-value data model implemented on HBase exhibits an average 5.24-fold increase in high-dimensional biological data query performance compared to the relational model implemented on MySQL Cluster, and an average 6.47-fold increase on query performance on MongoDB.

Conclusions

The performance evaluation found that the new key-value data model, in particular its implementation in HBase, outperforms the relational model currently implemented in tranSMART. We propose that NoSQL technology holds great promise for large-scale data management, in particular for high-dimensional biological data such as that demonstrated in the performance evaluation described in this paper. We aim to use this new data model as a basis for migrating tranSMART's implementation to a more scalable solution for Big Data.

SUBMITTER: Wang S

PROVIDER: S-EPMC4248814 | biostudies-literature | 2014

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

High dimensional biological data retrieval optimization with NoSQL technology.

Wang Shicai S Pandis Ioannis I Wu Chao C He Sijin S Johnson David D Emam Ibrahim I Guitton Florian F Guo Yike Y

BMC genomics 20141113

<h4>Background</h4>High-throughput transcriptomic data generated by microarray experiments is the most abundant and frequently stored kind of data currently used in translational medicine studies. Although microarray data is supported in data warehouses such as tranSMART, when querying relational databases for hundreds of different patient gene expression records queries are slow due to poor performance. Non-relational data models, such as the key-value model implemented in NoSQL databases, hold ...[more]

PMID: 25435347

Dataset Information

High dimensional biological data retrieval optimization with NoSQL technology.

Background

Results

Conclusions

Publications

High dimensional biological data retrieval optimization with NoSQL technology.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

Visualizing structure and transitions in high-dimensional biological data.
| S-EPMC7073148 | biostudies-literature

High-dimensional normalized data profiles for testing derivative-free optimization algorithms.
| S-EPMC9454945 | biostudies-literature

Feature optimization in high dimensional chemical space: statistical and data mining solutions.
| S-EPMC6044099 | biostudies-literature

Interaction-based feature selection and classification for high-dimensional biological data.
| S-EPMC3577111 | biostudies-literature

An imputation-regularized optimization algorithm for high dimensional missing data problems and beyond.
| S-EPMC6533005 | biostudies-literature

Scalable Clustering of High-Dimensional Data Technique Using SPCM with Ant Colony Optimization Intelligence.
| S-EPMC4606166 | biostudies-other

A user-friendly NoSQL framework for managing agricultural field trial data.
| S-EPMC11608345 | biostudies-literature

Accounting for unobserved covariates with varying degrees of estimability in high-dimensional biological data.
| S-EPMC6845853 | biostudies-literature

Tumor purity adjusted beta values improve biological interpretability of high-dimensional DNA methylation data.
| S-EPMC9462735 | biostudies-literature

Clustergrammer, a web-based heatmap visualization and analysis tool for high-dimensional biological data.
| S-EPMC5634325 | biostudies-literature