Dataset Information

BAMSI: a multi-cloud service for scalable distributed filtering of massive genome data.

ABSTRACT: BACKGROUND:The advent of next-generation sequencing (NGS) has made whole-genome sequencing of cohorts of individuals a reality. Primary datasets of raw or aligned reads of this sort can get very large. For scientific questions where curated called variants are not sufficient, the sheer size of the datasets makes analysis prohibitively expensive. In order to make re-analysis of such data feasible without the need to have access to a large-scale computing facility, we have developed a highly scalable, storage-agnostic framework, an associated API and an easy-to-use web user interface to execute custom filters on large genomic datasets. RESULTS:We present BAMSI, a Software as-a Service (SaaS) solution for filtering of the 1000 Genomes phase 3 set of aligned reads, with the possibility of extension and customization to other sets of files. Unique to our solution is the capability of simultaneously utilizing many different mirrors of the data to increase the speed of the analysis. In particular, if the data is available in private or public clouds - an increasingly common scenario for both academic and commercial cloud providers - our framework allows for seamless deployment of filtering workers close to data. We show results indicating that such a setup improves the horizontal scalability of the system, and present a possible use case of the framework by performing an analysis of structural variation in the 1000 Genomes data set. CONCLUSIONS:BAMSI constitutes a framework for efficient filtering of large genomic data sets that is flexible in the use of compute as well as storage resources. The data resulting from the filter is assumed to be greatly reduced in size, and can easily be downloaded or routed into e.g. a Hadoop cluster for subsequent interactive analysis using Hive, Spark or similar tools. In this respect, our framework also suggests a general model for making very large datasets of high scientific value more accessible by offering the possibility for organizations to share the cost of hosting data on hot storage, without compromising the scalability of downstream analysis.

SUBMITTER: Ausmees K

PROVIDER: S-EPMC6019789 | biostudies-literature | 2018 Jun

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

BAMSI: a multi-cloud service for scalable distributed filtering of massive genome data.

Ausmees Kristiina K John Aji A Toor Salman Z SZ Hellander Andreas A Nettelblad Carl C

BMC bioinformatics 20180626 1

<h4>Background</h4>The advent of next-generation sequencing (NGS) has made whole-genome sequencing of cohorts of individuals a reality. Primary datasets of raw or aligned reads of this sort can get very large. For scientific questions where curated called variants are not sufficient, the sheer size of the datasets makes analysis prohibitively expensive. In order to make re-analysis of such data feasible without the need to have access to a large-scale computing facility, we have developed a high ...[more]

PMID: 29940842

Similar Datasets

Project description:BackgroundNon-specific feature selection is a dimension reduction procedure performed prior to cluster analysis of high dimensional molecular data. Not all measured features are expected to show biological variation, so only the most varying are selected for analysis. In DNA methylation studies, DNA methylation is measured as a proportion, bounded between 0 and 1, with variance a function of the mean. Filtering on standard deviation biases the selection of probes to those with mean values near 0.5. We explore the effect this has on clustering, and develop alternate filter methods that utilize a variance stabilizing transformation for Beta distributed data and do not share this bias.ResultsWe compared results for 11 different non-specific filters on eight Infinium HumanMethylation data sets, selected to span a variety of biological conditions. We found that for data sets having a small fraction of samples showing abnormal methylation of a subset of normally unmethylated CpGs, a characteristic of the CpG island methylator phenotype in cancer, a novel filter statistic that utilized a variance-stabilizing transformation for Beta distributed data outperformed the common filter of using standard deviation of the DNA methylation proportion, or its log-transformed M-value, in its ability to detect the cancer subtype in a cluster analysis. However, the standard deviation filter always performed among the best for distinguishing subgroups of normal tissue. The novel filter and standard deviation filter tended to favour features in different genome contexts; for the same data set, the novel filter always selected more features from CpG island promoters and the standard deviation filter always selected more features from non-CpG island intergenic regions. Interestingly, despite selecting largely non-overlapping sets of features, the two filters did find sample subsets that overlapped for some real data sets.ConclusionsWe found two different filter statistics that tended to prioritize features with different characteristics, each performed well for identifying clusters of cancer and non-cancer tissue, and identifying a cancer CpG island hypermethylation phenotype. Since cluster analysis is for discovery, we would suggest trying both filters on any new data sets, evaluating the overlap of features selected and clusters discovered.

Project description:Recently, manufacturing firms and logistics service providers have been encouraged to deploy the most recent features of Information Technology (IT) to prevail in the competitive circumstances of manufacturing industries. Industry 4.0 and Cloud manufacturing (CMfg), accompanied by a service-oriented architecture model, have been regarded as renowned approaches to enable and facilitate the transition of conventional manufacturing business models into more efficient and productive ones. Furthermore, there is an aptness among the manufacturing and logistics businesses as service providers to synergize and cut down the investment and operational costs via sharing logistics fleet and production facilities in the form of outsourcing and consequently increase their profitability. Therefore, due to the Everything as a Service (XaaS) paradigm, efficient service composition is known to be a remarkable issue in the cloud manufacturing paradigm. This issue is challenging due to the service composition problem's large size and complicated computational characteristics. This paper has focused on the considerable number of continually received service requests, which must be prioritized and handled in the minimum possible time while fulfilling the Quality of Service (QoS) parameters. Considering the NP-hard nature and dynamicity of the allocation problem in the Cloud composition problem, heuristic and metaheuristic solving approaches are strongly preferred to obtain optimal or nearly optimal solutions. This study has presented an innovative, time-efficient approach for mutual manufacturing and logistical service composition with the QoS considerations. The method presented in this paper is highly competent in solving large-scale service composition problems time-efficiently while satisfying the optimality gap. A sample dataset has been synthesized to evaluate the outcomes of the developed model compared to earlier research studies. The results show the proposed algorithm can be applied to fulfill the dynamic behavior of manufacturing and logistics service composition due to its efficiency in solving time. The paper has embedded the relation of task and logistic services for cloud service composition in solving algorithm and enhanced the efficiency of resulted matched services. Moreover, considering the possibility of arrival of new services and demands into cloud, the proposed algorithm adapts the service composition algorithm.

Dataset Information

BAMSI: a multi-cloud service for scalable distributed filtering of massive genome data.

Publications

BAMSI: a multi-cloud service for scalable distributed filtering of massive genome data.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets