Date of Award

December 2016

Document Type

Dissertation

Degree Name

Doctor of Philosophy (PhD)

Department

School of Computing

Committee Member

Feng Luo

Committee Member

Pradip K Srimani

Committee Member

Rong Ge

Committee Member

Jim Martin

Abstract

High-performance Computing (HPC) clusters, which consist of a large number of compute nodes, have traditionally been widely employed in industry and academia to run diverse compute-intensive applications. In recent years, the revolution in data-driven science results in large volumes of data, often size in terabytes or petabytes, and makes data-intensive applications getting exponential growth. The data-intensive computing presents new challenges to HPC clusters due to the different workload characteristics and optimization objectives. One of those challenges is how to efficiently integrate software frameworks developed for big data analytics, such as Hadoop and Spark, with traditional HPC systems to support both data-intensive and compute-intensive workloads.

To address this challenge, we first present a novel two-level storage system, TLS, that integrates a distributed in-memory storage system with a parallel file system. The former renders memory-speed high I/O performance and the latter renders consistent storage with large capacity. We model and compare its I/O throughput to Hadoop distributed file system (HDFS) and OrangeFS (formerly PVFS2). We further build a prototype of TLS with Alluxio (formerly Tachyon) and OrangeFS, and evaluate its performance using MapReduce benchmarks. Both analyses and experiments on real systems show that the proposed storage architecture delivers higher aggregate I/O throughputs than HDFS and OrangeFS while retaining weak scalability on both read and write.

However, statically configured in-memory storage may leave inadequate space for compute-intensive jobs or lose the opportunity to utilize more available space for data-intensive applications. Then, we develop a dynamic memory controller, DynIMS, which infers memory demands of compute tasks in real time and employs a feedback-based control model to dynamically adjust the capacity of the in-memory storage system. The DynIMS is able to quickly release capacity of in-memory storage system for compute-intensive workload, as well as maximize the capacity of in-memory storage system for data-intensive applications when other compute workloads are finished. We test DynIMS using mixed HPCC and Spark workloads on a production HPC cluster. Experimental results show that DynIMS can achieve up to 5 performance improvement compared to systems with static memory allocations.

We expect the work in this dissertation helps further accelerate the adoption of big data frameworks to solve the data-intensive problems in traditional HPC systems, and gears up the converged computing infrastructure for both academia and industry.

Share

COinS