Document Type
Dissertation
Date of Award
5-31-2022
Degree Name
Doctor of Philosophy in Computing Sciences - (Ph.D.)
Department
Computer Science
First Advisor
Chase Qishi Wu
Second Advisor
Baruch Schieber
Third Advisor
Jing Li
Fourth Advisor
Pan Xu
Fifth Advisor
Dantong Yu
Sixth Advisor
Dimitrios Katramatos
Abstract
The processing and analysis of big data increasingly rely on workflow technologies for knowledge discovery and scientific innovation. The execution of such workflows goes far beyond the capability and capacity of single computers and is now commonly supported on reliable and scalable data storage and analysis platforms in distributed environments, such as the Hadoop ecosystem. Workflow performance largely depends on how big data systems are configured and used. For example, the makespan of a big data workflow is affected by multiple layers of big data systems, including the parallel computing engine it runs on, the resource manager that orchestrates various types of computing resources, the distributed file system that partitions and stores data blocks, the network infrastructure that connects all computing nodes, as well as the parameter settings of each layer. Furthermore, there exist significant compound effects on workflow performance between different layers that oftentimes exhibit complex interactive behaviors.
In this dissertation, a codesign framework for big data workflows is proposed, where the technology stack of big data systems is divided into multiple layers including application workflow, computing platform, resource management, data storage, and network infrastructure. In the network layer, the factors influencing traditional network performance are discussed, and one potential future quantum network design and its performance-critical parameters are investigated as well. In the resource management layer, a storage-aware scheduling algorithm is designed, which considers two layers of data movement to optimize workflow makespan. In the computing engine layer, the coupling effects of parameters are investigated and a stochastic approximation-based parameter tuning algorithm is designed to optimize the performance of Spark streaming applications. Lastly, a profiling-based cross-layer coupled design framework is proposed to determine the best parameter setting and identify the most suitable technique for each technology layer to optimize overall workflow performance. To tackle the large parameter space, two approaches are designed to reduce the number of experiments required for profiling: i) identify a subset of critical parameters with the most significant influence through information theory-based feature selection; and ii) minimize the search process within the value range of each critical parameter using stochastic approximation. The codesign framework together with the performance optimization techniques proposed in this dissertation is generic and hence is applicable to other big data systems to help users determine the most effective system configuration.
Recommended Citation
Ye, Qianwen, "Towards a cross-layer coupled design framework for big data workflows" (2022). Dissertations. 1787.
https://digitalcommons.njit.edu/dissertations/1787