On a parallel spark workflow for frequent itemset mining based on array prefix-tree
Document Type
Conference Proceeding
Publication Date
6-25-2022
Abstract
Extracting frequent itemsets from datasets is an important problem in data mining, for which several mining methods including FP-Growth have been proposed. FP-Growth is a classical frequent itemset mining method, which generates pattern databases without candidates. Many improvements have been made in the literature due to the high time complexity and memory usage of FP-Growth. However, most of them still suffer from performance issues on large datasets. In this paper, we design an auxiliary structure, Array Prefix-Tree (AP-Tree), and propose a new algorithm, Array Prefix-Tree Growth (APT-Growth), which is further parallelized as a Spark workflow, referred to as PAPT-Growth. Based on a density threshold, we incorporate an adaptive algorithm selection process into PAPT-Growth to ensure its running time performance. We conduct extensive experiments on different thresholds and multiple datasets, and experimental results show the performance superiority of PAPT-Growth in comparison with several state-of-the-art methods such as PFP, YAFIM, and DFPS. The analysis on density reveals a changing point, which justifies the necessity and validity of adaptive algorithm selection.
Identifier
85104832220 (Scopus)
Publication Title
Concurrency and Computation Practice and Experience
External Full Text Location
https://doi.org/10.1002/cpe.6313
e-ISSN
15320634
ISSN
15320626
Issue
14
Volume
34
Grant
SGSCXTOOXGJS1800219
Fund Ref
Sichuan Province Science and Technology Support Program
Recommended Citation
Niu, Xinzheng; Wu, Peng; Wu, Chase Q.; Hou, Aiqin; and Qian, Mideng, "On a parallel spark workflow for frequent itemset mining based on array prefix-tree" (2022). Faculty Publications. 2880.
https://digitalcommons.njit.edu/fac_pubs/2880