Dataset Information

Efficient string similarity join in multi-core and distributed systems.

ABSTRACT: In big data area a significant challenge about string similarity join is to find all similar pairs more efficiently. In this paper, we propose a parallel processing framework for efficient string similarity join. First, the input is split into some disjoint small subsets according to the joint frequency distribution and the interval distribution of strings. Then the filter-verification strategy is adopted in the computation of string similarity for each subset so that the number of candidate pairs is reduced before an effective pruning strategy is used to improve the performance. Finally, the operation of string join is executed in parallel. Para-Join algorithm based on the multi-threading technique is proposed to implement the framework in a multi-core system while Pada-Join algorithm based on Spark platform is proposed to implement the framework in a cluster system. We prove that Para-Join and Pada-Join cannot only avoid reduplicate computation but also ensure the completeness of the result. Experimental results show that Para-Join can achieve high efficiency and significantly outperform than state-of-the-art approaches, meanwhile, Pada-Join can work on large datasets.

SUBMITTER: Yan C

PROVIDER: S-EPMC5344375 | biostudies-other | 2017

REPOSITORIES: biostudies-other

ACCESS DATA

Publications

Efficient string similarity join in multi-core and distributed systems.

Yan Cairong C Zhao Xue X Zhang Qinglong Q Huang Yongfeng Y

PloS one 20170309 3

In big data area a significant challenge about string similarity join is to find all similar pairs more efficiently. In this paper, we propose a parallel processing framework for efficient string similarity join. First, the input is split into some disjoint small subsets according to the joint frequency distribution and the interval distribution of strings. Then the filter-verification strategy is adopted in the computation of string similarity for each subset so that the number of candidate pai ...[more]

PMID: 28278177

Similar Datasets

Project description:Distributed Systems architectures are becoming the standard computational model for processing and transportation of information, especially for Cloud Computing environments. The increase in demand for application processing and data management from enterprise and end-user workloads continues to move from a single-node client-server architecture to a distributed multitier design where data processing and transmission are segregated. Software development must considerer the orchestration required to provision its core components in order to deploy the services efficiently in many independent, loosely coupled-physically and virtually interconnected-data centers spread geographically, across the globe. This network routing challenge can be modeled as a variation of the Travelling Salesman Problem (TSP). This paper proposes a new optimization algorithm for optimum route selection using Algorithmic Information Theory. The Kelly criterion for a Shannon-Bernoulli process is used to generate a reliable quantitative algorithm to find a near optimal solution tour. The algorithm is then verified by comparing the results with benchmark heuristic solutions in 3 test cases. A statistical analysis is designed to measure the significance of the results between the algorithms and the entropy function can be derived from the distribution. The tested results shown an improvement in the solution quality by producing routes with smaller length and time requirements. The quality of the results proves the flexibility of the proposed algorithm for problems with different complexities without relying in nature-inspired models such as Genetic Algorithms, Ant Colony, Cross Entropy, Neural Networks, 2opt and Simulated Annealing. The proposed algorithm can be used by applications to deploy services across large cluster of nodes by making better decision in the route design. The findings in this paper unifies critical areas in Computer Science, Mathematics and Statistics that many researchers have not explored and provided a new interpretation that advances the understanding of the role of entropy in decision problems encoded in Turing Machines.

Project description:A distributed biological system can be defined as a system whose components are located in different subpopulations, which communicate and coordinate their actions through interpopulation messages and interactions. We see that distributed systems are pervasive in nature, performing computation across all scales, from microbial communities to a flock of birds. We often observe that information processing within communities exhibits a complexity far greater than any single organism. Synthetic biology is an area of research which aims to design and build synthetic biological machines from biological parts to perform a defined function, in a manner similar to the engineering disciplines. However, the field has reached a bottleneck in the complexity of the genetic networks that we can implement using monocultures, facing constraints from metabolic burden and genetic interference. This makes building distributed biological systems an attractive prospect for synthetic biology that would alleviate these constraints and allow us to expand the applications of our systems into areas including complex biosensing and diagnostic tools, bioprocess control and the monitoring of industrial processes. In this review we will discuss the fundamental limitations we face when engineering functionality with a monoculture, and the key areas where distributed systems can provide an advantage. We cite evidence from natural systems that support arguments in favor of distributed systems to overcome the limitations of monocultures. Following this we conduct a comprehensive overview of the synthetic communities that have been built to date, and the components that have been used. The potential computational capabilities of communities are discussed, along with some of the applications that these will be useful for. We discuss some of the challenges with building co-cultures, including the problem of competitive exclusion and maintenance of desired community composition. Finally, we assess computational frameworks currently available to aide in the design of microbial communities and identify areas where we lack the necessary tools.

Dataset Information

Efficient string similarity join in multi-core and distributed systems.

Publications

Efficient string similarity join in multi-core and distributed systems.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets