Improving HPC system performance by predicting job resources for submitted jobs using machine learning techniques

dc.contributor.authorTanash, Mohammed
dc.date.accessioned2021-11-12T22:19:06Z
dc.date.available2021-11-12T22:19:06Z
dc.date.graduationmonthDecember
dc.date.issued2021
dc.description.abstractOverestimation of High Performance Computing (HPC) job resources allocation happened because of the wide variety of HPC applications, environment configuration options, and the lack of knowledge of the complex structure of HPC systems. This overestimation of resources will waste and devour HPC resources; hence, this will lead to inefficient cluster utilization, increased wait times, and increased turnaround time for submitted jobs. With this background, this dissertation aims to investigate the benefits, effects, and challenges of using machine learning techniques for predicting job resources on HPC systems from different perspectives. First, we have developed a machine learning model based on using several supervised ML discriminative models from the scikit-learn machine learning library applied on historical data from SunGrid Engine (SGE) provided by an HPC service provider at the Kansas State University called Beocat. Our methodology achieved high accuracy in predicting the amount of required time and the amount of required memory for Beocat HPC resources. Second, we have designed a machine learning methodology called Mixed Account Regression Model (MARM) built based on several supervised machines learning discriminative models from the scikit-learn machine learning library and LightGBM. Our work has been implemented and tested using historical data (sacct data) provided from two HPC providers, an XSEDE service provider at the University of Colorado-Boulder (RMACC Summit) and the Kansas State University (Beocat). Our models help dramatically reduce computational average waiting time, reduce turnaround time. Moreover, our models help achieve higher utilization, throughput, and efficiency for HPC resources. Third, we introduced our first-ever implemented, fully-offline, fully-automated, stand-alone, and open-source Machine Learning tool called AMPRO-HPCC. Our tool aims to help HPC admins and HPC users predict memory and time requirements for their submitted jobs on HPC Clusters. Finally, we study and investigate the impact of our machine learning models in running jobs on the cloud by comparing the cost of running the jobs with and without using our machine learning models on most popular cloud computing resources, including Amazon Web Services such as (AWS), Microsoft Azure, Google Cloud Platform, Digital Ocean, IBM Cloud, and using the local resources of Holland Computing Center at the University of Nebraska - Lincoln. In summary, in this work, we present and develop novel methodologies for predicting job resources (memory and time) for submitted jobs on HPC systems based on historical jobs data provided by the HPC systems scheduler. The outcomes are expected to dramatically reduce computational average waiting time, reduce turnaround time for submitted jobs. Moreover, increased utilization, increased throughput, improved efficiency, and decreased power consumption for the HPC systems.
dc.description.advisorDaniel A. Andresen
dc.description.degreeDoctor of Philosophy
dc.description.departmentDepartment of Computer Science
dc.description.levelDoctoral
dc.identifier.urihttps://hdl.handle.net/2097/41783
dc.language.isoen_US
dc.publisherKansas State University
dc.rights© the author. This Item is protected by copyright and/or related rights. You are free to use this Item in any way that is permitted by the copyright and related rights legislation that applies to your use. For other uses you need to obtain permission from the rights-holder(s).
dc.rights.urihttp://rightsstatements.org/vocab/InC/1.0/
dc.subjectHigh performance computing
dc.subjectScheduling
dc.subjectMachine learning
dc.subjectSupervised learning
dc.titleImproving HPC system performance by predicting job resources for submitted jobs using machine learning techniques
dc.typeDissertation

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
MohammedTanash2021.pdf
Size:
10.83 MB
Format:
Adobe Portable Document Format
Description:

License bundle

Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.62 KB
Format:
Item-specific license agreed upon to submission
Description: