Date of Award
Master of Science (MS)
Jon C Calhoun
Due to improvements in high-performance computing (HPC) systems, researchers have created powerful applications capable of solving previously intractable problems. While solving these problems, such applications create vast amounts of data, which stresses the I/O subsystem. Researchers use lossy compression to remedy this issue by reducing the data's size, but, as we demonstrate in this thesis, a single soft error leaves lossy compressed data unusable. Due to the high information content per bit ratio, lossy compressed data is sensitive to soft errors, which is an issue as soft errors have become commonplace on HPC systems. Yet, few works have sought to resolve this significant weakness.
This thesis addresses the lack of works by performing an extensive soft error assessment and providing an approach to resolving the soft error sensitivity demonstrated by lossy compressed data. Upon evaluating the SZ and ZFP lossy compression algorithms, we find 95.28% of all trials led to error propagation and silent data corruption (SDC). Furthermore, 100% of trials using ZFP led to the same conclusion. Our findings also indicate that, on average, a single soft error propagates to ∼10% of data values. We find this trend exists for both SZ and ZFP and fluctuates with different compression ratios. Lastly, we find significant drops in the resulting data integrity due to a single soft error. Our findings indicate that all error bounding modes we test are susceptible to soft errors. The only exception is the block-based compression algorithm, which prevents the error from propagating outside the block.
Leveraging our findings, we develop ARC (Automated Resiliency for Compression). ARC automatically determines and applies the optimal error-correcting code (ECC) configuration to data while respecting user constraints on storage, throughput, and resiliency. ARC's design centers around four main goals: scalability, performance, resiliency, and ease of use. We evaluate ARC using these four goals. Upon assessing the scalability of ARC's underlying ECC algorithms, we find each approach scales near linearly with encoding throughputs ranging from 0.04 - 3730 MB/s and decoding throughputs ranging from 10.64 - 3602 MB/s when working on a 40 core node. When evaluating how ARC satisfies user constraints, we find ARC adequately meets user needs whether they synergize or conflict with one another. After evaluating ARC's resiliency, we find ARC effectively resolves both single-bit and multi-bit soft errors depending on the provided user constraints. Lastly, we demonstrate the four lines of code needed to implement ARC and show how users should consider the failure rate of a system when choosing constraints to illustrate its ease of use. Overall, this thesis demonstrates the soft error vulnerabilities of lossy compressed data along with a practical approach to resolving these vulnerabilities.
Fulp, Dakota Kent, "Resolving Soft Error Susceptibilities Within Lossy Compressed HPC Data" (2021). All Theses. 3657.