Date of Award
Doctor of Philosophy in Electrical Engineering - (Ph.D.)
Electrical and Computer Engineering
Qing Gary Liu
Scientific research generates vast amounts of data, and the scale of data has significantly increased with advancements in scientific applications. To manage this data effectively, lossy data compression techniques are necessary to reduce storage and transmission costs. Nevertheless, the use of lossy compression introduces uncertainties related to its performance. This dissertation aims to answer key questions surrounding lossy data compression, such as how the performance changes, how much reduction can be achieved, and how to optimize these techniques for modern scientific data management workflows.
One of the major challenges in adopting lossy compression techniques is the trade-off between data accuracy and compression performance, particularly the compression ratio. This trade-off is not well understood, leading to a trial-and-error approach in selecting appropriate setups. To address this, the dissertation analyzes and estimates the compression performance of two modern lossy compressors, SZ and ZFP, on HPC datasets at various error bounds. By predicting compression ratios based on intrinsic metrics collected under a given base error bound, the effectiveness of the estimation scheme is confirmed through evaluations using real HPC datasets.
Furthermore, as scientific simulations scale up on HPC systems, the disparity between computation and input/output (I/O) becomes a significant challenge. To overcome this, error-bounded lossy compression has emerged as a solution to bridge the gap between computation and I/O. Nonetheless, the lack of understanding of compression performance hinders the wider adoption of lossy compression. The dissertation aims to address this challenge by examining the complex interaction between data, error bounds, and compression algorithms, providing insights into compression performance and its implications for scientific production.
Lastly, the dissertation addresses the performance limitations of progressive data retrieval frameworks for post-hoc data analytics on full-resolution scientific simulation data. Existing frameworks suffer from over-pessimistic error control theory, leading to fetching more data than necessary for recomposition, resulting in additional I/O overhead. To enhance the performance of progressive retrieval, deep neural networks are leveraged to optimize the error control mechanism, reducing unnecessary data fetching and improving overall efficiency.
By tackling these challenges and providing insights, this dissertation contributes to the advancement of scientific data management, lossy data compression techniques, and HPC progressive data retrieval frameworks. The findings and methodologies presented pave the way for more efficient and effective management of large-scale scientific data, facilitating enhanced scientific research and discovery.
In future research, this dissertation highlights the importance of investigating the impact of lossy data compression on downstream analysis. On the one hand, more data reduction can be achieved under scenarios like image visualization where the error tolerance is very high, leading to less I/O and communication overhead. On the other hand, post-hoc calculations based on physical properties after compression may lead to misinterpretation, as the statistical information of such properties might be compromised during compression. Therefore, a comprehensive understanding of the impact of lossy data compression on each specific scenario is vital to ensure accurate analysis and interpretation of results.
Wang, Jinzhen, "Toward smart and efficient scientific data management" (2023). Dissertations. 1687.