Unknown

Dataset Information

0

Efficient string similarity join in multi-core and distributed systems.


ABSTRACT: In big data area a significant challenge about string similarity join is to find all similar pairs more efficiently. In this paper, we propose a parallel processing framework for efficient string similarity join. First, the input is split into some disjoint small subsets according to the joint frequency distribution and the interval distribution of strings. Then the filter-verification strategy is adopted in the computation of string similarity for each subset so that the number of candidate pairs is reduced before an effective pruning strategy is used to improve the performance. Finally, the operation of string join is executed in parallel. Para-Join algorithm based on the multi-threading technique is proposed to implement the framework in a multi-core system while Pada-Join algorithm based on Spark platform is proposed to implement the framework in a cluster system. We prove that Para-Join and Pada-Join cannot only avoid reduplicate computation but also ensure the completeness of the result. Experimental results show that Para-Join can achieve high efficiency and significantly outperform than state-of-the-art approaches, meanwhile, Pada-Join can work on large datasets.

SUBMITTER: Yan C 

PROVIDER: S-EPMC5344375 | biostudies-other | 2017

REPOSITORIES: biostudies-other

altmetric image

Publications

Efficient string similarity join in multi-core and distributed systems.

Yan Cairong C   Zhao Xue X   Zhang Qinglong Q   Huang Yongfeng Y  

PloS one 20170309 3


In big data area a significant challenge about string similarity join is to find all similar pairs more efficiently. In this paper, we propose a parallel processing framework for efficient string similarity join. First, the input is split into some disjoint small subsets according to the joint frequency distribution and the interval distribution of strings. Then the filter-verification strategy is adopted in the computation of string similarity for each subset so that the number of candidate pai  ...[more]

Similar Datasets

| S-EPMC7781673 | biostudies-literature
| S-EPMC2865495 | biostudies-literature
| S-EPMC8490431 | biostudies-literature
| S-EPMC4892414 | biostudies-literature
| S-EPMC4037181 | biostudies-literature
| S-EPMC3507659 | biostudies-literature
| S-EPMC3366991 | biostudies-literature
| S-EPMC7387671 | biostudies-literature
| S-EPMC6553594 | biostudies-literature
| S-EPMC7035299 | biostudies-literature