Document Type
Dissertation
Date of Award
12-31-2021
Degree Name
Doctor of Philosophy in Computer Engineering - (Ph.D.)
Department
Electrical and Computer Engineering
First Advisor
Qing Gary Liu
Second Advisor
Nirwan Ansari
Third Advisor
MengChu Zhou
Fourth Advisor
Roberto Rojas-Cessa
Fifth Advisor
Xiaoning Ding
Abstract
As high-performance computing (HPC) is being scaled up to exascale to accommodate new modeling and simulation needs, I/O has continued to be a major bottleneck in the end-to-end scientific processes. To bridge the widening gap between compute and I/O, and enable data to be more efficiently stored and analyzed, simulation outputs need to be refactored, reduced, and appropriately mapped to storage tiers. Also, a major question that the community is striving to answer is how to co-design data storage and complex physics-rich analytics in a way that the time to knowledge can be minimized in post-processing. As HPC storage systems have become deeper and more complex, it requires fundamentally rethinking new paradigms and methods for data storage and analysis. To that end, this dissertation develops SIRIUS, a progressive JPEG-like data management scheme for storing and analyzing big scientific data and a coordinated cross-layer approach that reacts to storage interference from both storage and application layers.
For data storage, with reasonably low overhead, the proposed approach refactors simulation data into a much smaller, reduced-accuracy base dataset, and a series of deltas that is used to augment the accuracy if needed. The base dataset and deltas are compressed and written to multiple storage tiers. For data analysis, this work aims to address the issue of I/O interference for data analytics on cgroups-based storage. In particular, this work explores the emerging scenario of containerization on HPC systems where the local storage is shared by multiple containers. By decomposing and distributing analysis data across the storage hierarchy, data analytics can adapt to the interference by reducing or completely avoiding the access to lower tiers whenever there is a high interference, while maintaining a prescribed error bound to limit the information loss. Meanwhile, proper actions are also taken at the storage layer to ensure sufficient bandwidth are allocated for retrieving an augmentation.
This dissertation evaluates three data analytics, XGC, GenASiS, and CFD to understand the impact of SIRIUS and quantitatively demonstrate that the I/O performance can be vastly improved, e.g., by 57% versus no adaptivity and 41% versus single-layer adaptivity, while maintaining acceptable outcomes of data analysis.
Recommended Citation
Qiao, Zhenbo, "Scalable scientific data management on high-performance computing systems" (2021). Dissertations. 1785.
https://digitalcommons.njit.edu/dissertations/1785
Included in
Computer Engineering Commons, Computer Sciences Commons, Electrical and Electronics Commons