On a parallel spark workflow for frequent itemset mining based on array prefix-tree

Document Type

Conference Proceeding

Publication Date

11-1-2019

Abstract

Frequent Itemset Mining (FIM) is a fundamental procedure in various data mining techniques such as association rule mining. Among many existing algorithms, FP-Growth is considered as a milestone achievement that discovers frequenti temsets without generating candidates. However, due to the high complexity of its mining process and the high cost of its memory usage, FP-Growth still suffers from a performance bottleneck when dealing with large datasets. In this paper, we design a new Array Prefix-Tree structure, and based on that, propose an Array Prefix-Tree Growth (APT-Growth) algorithm, which explicitly obviates the need of recursively constructing conditional FP-Tree as required by FP-Growth. To support big data analytics, we further design and implement a parallel version of APTGrowth, referred to as PAPT-Growth, as a Spark workflow. We conduct FIM workflow experiments on both real-life and synthetic datasets for performance evaluation, and extensive results show that PAPT-Growth outperforms other representative parallel FIM algorithms in terms of execution time, which sheds light on its potential applications to big data mining.

Identifier

85078082963 (Scopus)

ISBN

[9781728159973]

Publication Title

Proceedings of Works 2019 14th Workshop on Workflows in Support of Large Scale Science Held in Conjunction with Sc 2019 the International Conference for High Performance Computing Networking Storage and Analysis

External Full Text Location

https://doi.org/10.1109/WORKS49585.2019.00011

First Page

50

Last Page

59

This document is currently not available here.

Share

COinS