Throughput optimization for Storm-based processing of stream data on clouds

Document Type

Article

Publication Date

11-1-2020

Abstract

There is a rapidly growing need for processing large volumes of streaming data in real time in various big data applications. As one of the most commonly used systems for streaming data processing, Apache Storm provides a workflow-based mechanism to execute directed acyclic graph (DAG)-structured topologies. With the expansion of cloud infrastructures around the globe and the economic benefits of cloud-based computing and storage services, many such Storm workflows have been shifted or are in active transition to clouds. However, modeling the behavior of streaming data processing and improving its performance in clouds still remain largely unexplored. We construct rigorous cost models to analyze the throughput dynamics of Storm workflows and formulate a budget-constrained topology mapping problem to maximize Storm workflow throughput in clouds. We show this problem to be NP-complete and design a heuristic solution that takes into consideration not only the selection of virtual machine type but also the degree of parallelism for each task (spout/bolt) in the topology. The performance superiority of the proposed mapping solution is illustrated through extensive simulations and further verified by real-life workflow experiments deployed in public clouds in comparison with the default Storm and other existing methods.

Identifier

85086361943 (Scopus)

Publication Title

Future Generation Computer Systems

External Full Text Location

https://doi.org/10.1016/j.future.2020.06.009

ISSN

0167739X

First Page

567

Last Page

579

Volume

112

Grant

1828123

Fund Ref

National Science Foundation

This document is currently not available here.

Share

COinS