On a parallel spark workflow for frequent itemset mining based on array prefix-tree
Document Type
Conference Proceeding
Publication Date
11-1-2019
Abstract
Frequent Itemset Mining (FIM) is a fundamental procedure in various data mining techniques such as association rule mining. Among many existing algorithms, FP-Growth is considered as a milestone achievement that discovers frequenti temsets without generating candidates. However, due to the high complexity of its mining process and the high cost of its memory usage, FP-Growth still suffers from a performance bottleneck when dealing with large datasets. In this paper, we design a new Array Prefix-Tree structure, and based on that, propose an Array Prefix-Tree Growth (APT-Growth) algorithm, which explicitly obviates the need of recursively constructing conditional FP-Tree as required by FP-Growth. To support big data analytics, we further design and implement a parallel version of APTGrowth, referred to as PAPT-Growth, as a Spark workflow. We conduct FIM workflow experiments on both real-life and synthetic datasets for performance evaluation, and extensive results show that PAPT-Growth outperforms other representative parallel FIM algorithms in terms of execution time, which sheds light on its potential applications to big data mining.
Identifier
85078082963 (Scopus)
ISBN
[9781728159973]
Publication Title
Proceedings of Works 2019 14th Workshop on Workflows in Support of Large Scale Science Held in Conjunction with Sc 2019 the International Conference for High Performance Computing Networking Storage and Analysis
External Full Text Location
https://doi.org/10.1109/WORKS49585.2019.00011
First Page
50
Last Page
59
Recommended Citation
Niu, Xinzheng; Qian, Mideng; Wu, Chase; and Hou, Aiqin, "On a parallel spark workflow for frequent itemset mining based on array prefix-tree" (2019). Faculty Publications. 7237.
https://digitalcommons.njit.edu/fac_pubs/7237
