Canopus: Enabling extreme-scale data analytics on big HPC storage via progressive refactoring
Document Type
Conference Proceeding
Publication Date
1-1-2017
Abstract
High accuracy scientific simulations on high performance computing (HPC) platforms generate large amounts of data. To allow data to be efficiently analyzed, simulation outputs need to be refactored, compressed, and properly mapped onto storage tiers. This paper presents Canopus, a progressive data management framework for storing and analyzing big scientific data. Canopus allows simulation results to be refactored into a much smaller dataset along with a series of deltas with fairly low overhead. Then, the refactored data are compressed, mapped, and written onto storage tiers. For data analytics, refactored data are selectively retrieved to restore data at a specific level of accuracy that satisfies analysis requirements. Canopus enables end users to make trade-offs between analysis speed and accuracy on-the-fly. Canopus is demonstrated and thoroughly evaluated using blob detection on fusion simulation data.
Identifier
85084161356 (Scopus)
Publication Title
9th Usenix Workshop on Hot Topics in Storage and File Systems Hotstorage 2017 Co Located with Usenix Atc 2017
Grant
17-SC-20-SC
Fund Ref
U.S. Department of Energy
Recommended Citation
Lu, Tao; Suchyta, Eric; Choi, Jong; Podhorszki, Norbert; Klasky, Scott; Liu, Qing; Pugmire, Dave; Wolf, Matthew; and Ainsworth, Mark, "Canopus: Enabling extreme-scale data analytics on big HPC storage via progressive refactoring" (2017). Faculty Publications. 10000.
https://digitalcommons.njit.edu/fac_pubs/10000
