Optimizing high performance computing system’s, resource utilization and throughput by leveraging machine learning

Date

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

High Performance Computing (HPC) facilitates a significant portion of research and analytics across many different fields, industries, and education. HPC is implemented using supercomputers, which can be comprised of a few servers to tens to thousands. HPC systems typically use a scheduler - such as Slurm - to manage the execution of tasks on the system. Schedulers typically have hundreds of configuration parameters. With such diverse workflows and hardware the question becomes: how do we adapt these HPC schedulers so that we keep a high utilization and throughput on the systems? Our research focuses on optimizing the SLURM scheduler by adapting its configuration options based on the type of hardware in the High Performance Computing system and types of workflows, utilizing Semi-supervised Machine Learning.

Description

Keywords

SLURM, HPC, Machine learning

Graduation Month

May

Degree

Master of Science

Department

Department of Computer Science

Major Professor

Daniel A. Andresen

Date

2021

Type

Thesis

Citation