Learning Efficient Parameter Server Synchronization Policies for Distributed SGD

Rong Zhu; Sheng Yang; Andreas Pfadler; Zhengping Qian; Jingren Zhou

Learning Efficient Parameter Server Synchronization Policies for Distributed SGD

Rong Zhu, Sheng Yang, Andreas Pfadler, Zhengping Qian, Jingren Zhou

Published: 20 Dec 2019, Last Modified: 05 May 2023ICLR 2020 Conference Blind SubmissionReaders: Everyone

Keywords: Distributed SGD, Paramter-Server, Synchronization Policy, Reinforcement Learning

TL;DR: We apply a reinforcement learning based approach to learning optimal synchronization policies used for Parameter Server-based distributed training of SGD.

Abstract: We apply a reinforcement learning (RL) based approach to learning optimal synchronization policies used for Parameter Server-based distributed training of machine learning models with Stochastic Gradient Descent (SGD). Utilizing a formal synchronization policy description in the PS-setting, we are able to derive a suitable and compact description of states and actions, allowing us to efficiently use the standard off-the-shelf deep Q-learning algorithm. As a result, we are able to learn synchronization policies which generalize to different cluster environments, different training datasets and small model variations and (most importantly) lead to considerable decreases in training time when compared to standard policies such as bulk synchronous parallel (BSP), asynchronous parallel (ASP), or stale synchronous parallel (SSP). To support our claims we present extensive numerical results obtained from experiments performed in simulated cluster environments. In our experiments training time is reduced by 44 on average and learned policies generalize to multiple unseen circumstances.

Original Pdf: pdf

13 Replies

Loading