Exploring Efficient ML-based Scheduler for Microservices in Heterogenous Clusters

Rohan Mahapatra; Byung Hoon Ahn; Shu-Ting Wang; Hanyang Xu; Hadi Esmaeilzadeh

Exploring Efficient ML-based Scheduler for Microservices in Heterogenous Clusters

Rohan Mahapatra, Byung Hoon Ahn, Shu-Ting Wang, Hanyang Xu, Hadi Esmaeilzadeh

Published: 30 May 2022, Last Modified: 05 May 2023MLArchSys 2022Readers: Everyone

Keywords: Data Center Scheduling, ML for Systems

TL;DR: Exploring ML based Scheduling for Data Centers

Abstract: In the recent years, cloud computing is going though a major transformation throughout its system stack, from its application to hardware. Its services are increasingly shifting from large monolithic applications to complex graphs with many single-purpose microservices, which offer many advantages in terms of deployment and development. On the other hand, cloud datacenters are becoming increasingly heterogeneous as they host more GPUs, FPGAs, and ASICs. While this heterogenous hardware can not only accelerate but also expand the capability of microservices, they further complicate the complex action space in microservices scheduling. Importantly, the convergence of the changes in both applications and hardware brings up unique challenges in datacenter scheduling for microservices. Recent innovations has shown that data-driven Machine Learning (ML) approaches leveraging neural networks can improve both end-to-end latency of the applications and probability of QoS violations. However, these works have focused on a rather homogeneous clusters which may become prohibitive as the scheduling problem gets more complex as the datacenters become heterogeneous. This paper first analyzes the potential limitations of the previous approaches and explores a new dimension of efficiency in the development of schedulers for microservices by incorporating a light-weight ML-based model. To this end, the paper develops a prototype light-weight ML-based scheduler dubbed Octopus that harbors a decision tree to efficiently schedule microservices on heterogeneous clusters. Comparisons against conventional scheduling techniques including First-Fit, Random, and Kubernetes-like schedulers show that Octopus provides up to 6.35x faster end-to-end latency.

4 Replies

Loading