Document Type

Dissertation

Date of Award

5-31-2022

Degree Name

Doctor of Philosophy in Computing Sciences - (Ph.D.)

Department

Computer Science

First Advisor

Chase Qishi Wu

Second Advisor

Baruch Schieber

Third Advisor

Jing Li

Fourth Advisor

Pan Xu

Fifth Advisor

Dantong Yu

Sixth Advisor

Dimitrios Katramatos

Abstract

The processing and analysis of big data increasingly rely on workflow technologies for knowledge discovery and scientific innovation. The execution of such workflows goes far beyond the capability and capacity of single computers and is now commonly supported on reliable and scalable data storage and analysis platforms in distributed environments, such as the Hadoop ecosystem. Workflow performance largely depends on how big data systems are configured and used. For example, the makespan of a big data workflow is affected by multiple layers of big data systems, including the parallel computing engine it runs on, the resource manager that orchestrates various types of computing resources, the distributed file system that partitions and stores data blocks, the network infrastructure that connects all computing nodes, as well as the parameter settings of each layer. Furthermore, there exist significant compound effects on workflow performance between different layers that oftentimes exhibit complex interactive behaviors.

In this dissertation, a codesign framework for big data workflows is proposed, where the technology stack of big data systems is divided into multiple layers including application workflow, computing platform, resource management, data storage, and network infrastructure. In the network layer, the factors influencing traditional network performance are discussed, and one potential future quantum network design and its performance-critical parameters are investigated as well. In the resource management layer, a storage-aware scheduling algorithm is designed, which considers two layers of data movement to optimize workflow makespan. In the computing engine layer, the coupling effects of parameters are investigated and a stochastic approximation-based parameter tuning algorithm is designed to optimize the performance of Spark streaming applications. Lastly, a profiling-based cross-layer coupled design framework is proposed to determine the best parameter setting and identify the most suitable technique for each technology layer to optimize overall workflow performance. To tackle the large parameter space, two approaches are designed to reduce the number of experiments required for profiling: i) identify a subset of critical parameters with the most significant influence through information theory-based feature selection; and ii) minimize the search process within the value range of each critical parameter using stochastic approximation. The codesign framework together with the performance optimization techniques proposed in this dissertation is generic and hence is applicable to other big data systems to help users determine the most effective system configuration.

Share

COinS
 
 

To view the content in your browser, please download Adobe Reader or, alternately,
you may Download the file to your hard drive.

NOTE: The latest versions of Adobe Reader do not support viewing PDF files within Firefox on Mac OS and if you are using a modern (Intel) Mac, there is no official plugin for viewing PDF files within the browser window.