Optimizing high performance computing system’s, resource utilization and throughput by leveraging machine learning
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
High Performance Computing (HPC) facilitates a significant portion of research and analytics across many different fields, industries, and education. HPC is implemented using supercomputers, which can be comprised of a few servers to tens to thousands. HPC systems typically use a scheduler - such as Slurm - to manage the execution of tasks on the system. Schedulers typically have hundreds of configuration parameters. With such diverse workflows and hardware the question becomes: how do we adapt these HPC schedulers so that we keep a high utilization and throughput on the systems? Our research focuses on optimizing the SLURM scheduler by adapting its configuration options based on the type of hardware in the High Performance Computing system and types of workflows, utilizing Semi-supervised Machine Learning.