Meta-Q-LearningDownload PDF

Published: 20 Dec 2019, Last Modified: 03 Apr 2024ICLR 2020 Conference Blind SubmissionReaders: Everyone
Keywords: meta reinforcement learning, propensity estimation, off-policy
TL;DR: MQL is a simple off-policy meta-RL algorithm that recycles data from the meta-training replay buffer to adapt to new tasks.
Abstract: This paper introduces Meta-Q-Learning (MQL), a new off-policy algorithm for meta-Reinforcement Learning (meta-RL). MQL builds upon three simple ideas. First, we show that Q-learning is competitive with state-of-the-art meta-RL algorithms if given access to a context variable that is a representation of the past trajectory. Second, a multi-task objective to maximize the average reward across the training tasks is an effective method to meta-train RL policies. Third, past data from the meta-training replay buffer can be recycled to adapt the policy on a new task using off-policy updates. MQL draws upon ideas in propensity estimation to do so and thereby amplifies the amount of available data for adaptation. Experiments on standard continuous-control benchmarks suggest that MQL compares favorably with the state of the art in meta-RL.
Code: [![github](/images/github_icon.svg) amazon-research/meta-q-learning](https://github.com/amazon-research/meta-q-learning) + [![Papers with Code](/images/pwc_icon.svg) 1 community implementation](https://paperswithcode.com/paper/?openreview=SJeD3CEFPH)
Data: [OpenAI Gym](https://paperswithcode.com/dataset/openai-gym)
Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 1 code implementation](https://www.catalyzex.com/paper/arxiv:1910.00125/code)
Original Pdf: pdf
14 Replies

Loading