Document Type

Dissertation

Date of Award

5-31-2022

Degree Name

Doctor of Philosophy in Computing Sciences - (Ph.D.)

Department

Computer Science

First Advisor

Chase Qishi Wu

Second Advisor

Baruch Schieber

Third Advisor

Jing Li

Fourth Advisor

Pan Xu

Fifth Advisor

Dantong Yu

Sixth Advisor

Dimitrios Katramatos

Abstract

The processing and analysis of big data increasingly rely on workflow technologies for knowledge discovery and scientific innovation. The execution of such workflows goes far beyond the capability and capacity of single computers and is now commonly supported on reliable and scalable data storage and analysis platforms in distributed environments, such as the Hadoop ecosystem. Workflow performance largely depends on how big data systems are configured and used. For example, the makespan of a big data workflow is affected by multiple layers of big data systems, including the parallel computing engine it runs on, the resource manager that orchestrates various types of computing resources, the distributed file system that partitions and stores data blocks, the network infrastructure that connects all computing nodes, as well as the parameter settings of each layer. Furthermore, there exist significant compound effects on workflow performance between different layers that oftentimes exhibit complex interactive behaviors.

In this dissertation, a codesign framework for big data workflows is proposed, where the technology stack of big data systems is divided into multiple layers including application workflow, computing platform, resource management, data storage, and network infrastructure. In the network layer, the factors influencing traditional network performance are discussed, and one potential future quantum network design and its performance-critical parameters are investigated as well. In the resource management layer, a storage-aware scheduling algorithm is designed, which considers two layers of data movement to optimize workflow makespan. In the computing engine layer, the coupling effects of parameters are investigated and a stochastic approximation-based parameter tuning algorithm is designed to optimize the performance of Spark streaming applications. Lastly, a profiling-based cross-layer coupled design framework is proposed to determine the best parameter setting and identify the most suitable technique for each technology layer to optimize overall workflow performance. To tackle the large parameter space, two approaches are designed to reduce the number of experiments required for profiling: i) identify a subset of critical parameters with the most significant influence through information theory-based feature selection; and ii) minimize the search process within the value range of each critical parameter using stochastic approximation. The codesign framework together with the performance optimization techniques proposed in this dissertation is generic and hence is applicable to other big data systems to help users determine the most effective system configuration.

Recommended Citation

Ye, Qianwen, "Towards a cross-layer coupled design framework for big data workflows" (2022). Dissertations. 1787.
https://digitalcommons.njit.edu/dissertations/1787

Download

Included in

Computer Engineering Commons, Computer Sciences Commons, Quantum Physics Commons

COinS

Dissertations

Towards a cross-layer coupled design framework for big data workflows

Document Type

Date of Award

Degree Name

Department

First Advisor

Second Advisor

Third Advisor

Fourth Advisor

Fifth Advisor

Sixth Advisor

Abstract

Recommended Citation

Included in

Search

Browse

Author Corner

Links

Dissertations

Towards a cross-layer coupled design framework for big data workflows

Author

Document Type

Date of Award

Degree Name

Department

First Advisor

Second Advisor

Third Advisor

Fourth Advisor

Fifth Advisor

Sixth Advisor

Abstract

Recommended Citation

Included in

Share

Search

Browse

Author Corner

Links