Improving HPC system performance by predicting job resources for submitted jobs using machine learning techniques

Date

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Overestimation of High Performance Computing (HPC) job resources allocation happened because of the wide variety of HPC applications, environment configuration options, and the lack of knowledge of the complex structure of HPC systems. This overestimation of resources will waste and devour HPC resources; hence, this will lead to inefficient cluster utilization, increased wait times, and increased turnaround time for submitted jobs.

With this background, this dissertation aims to investigate the benefits, effects, and challenges of using machine learning techniques for predicting job resources on HPC systems from different perspectives.

First, we have developed a machine learning model based on using several supervised ML discriminative models from the scikit-learn machine learning library applied on historical data from SunGrid Engine (SGE) provided by an HPC service provider at the Kansas State University called Beocat. Our methodology achieved high accuracy in predicting the amount of required time and the amount of required memory for Beocat HPC resources.

Second, we have designed a machine learning methodology called Mixed Account Regression Model (MARM) built based on several supervised machines learning discriminative models from the scikit-learn machine learning library and LightGBM. Our work has been implemented and tested using historical data (sacct data) provided from two HPC providers, an XSEDE service provider at the University of Colorado-Boulder (RMACC Summit) and the Kansas State University (Beocat). Our models help dramatically reduce computational average waiting time, reduce turnaround time. Moreover, our models help achieve higher utilization, throughput, and efficiency for HPC resources.

Third, we introduced our first-ever implemented, fully-offline, fully-automated, stand-alone, and open-source Machine Learning tool called AMPRO-HPCC. Our tool aims to help HPC admins and HPC users predict memory and time requirements for their submitted jobs on HPC Clusters.

Finally, we study and investigate the impact of our machine learning models in running jobs on the cloud by comparing the cost of running the jobs with and without using our machine learning models on most popular cloud computing resources, including Amazon Web Services such as (AWS), Microsoft Azure, Google Cloud Platform, Digital Ocean, IBM Cloud, and using the local resources of Holland Computing Center at the University of Nebraska - Lincoln.

In summary, in this work, we present and develop novel methodologies for predicting job resources (memory and time) for submitted jobs on HPC systems based on historical jobs data provided by the HPC systems scheduler. The outcomes are expected to dramatically reduce computational average waiting time, reduce turnaround time for submitted jobs. Moreover, increased utilization, increased throughput, improved efficiency, and decreased power consumption for the HPC systems.

Description

Keywords

High performance computing, Scheduling, Machine learning, Supervised learning

Graduation Month

December

Degree

Doctor of Philosophy

Department

Department of Computer Science

Major Professor

Daniel A. Andresen

Date

2021

Type

Dissertation

Citation