AMAGO follows a simple and scalable recipe for building RL agents that can generalize:
- Turn meta-learning into a memory problem ("in-context RL") .
- Put all of our effort into learning effective memory with end-to-end RL.
- Treat zero-shot generalization and multi-task RL as special cases of meta-learning.
- Then, we can use one method to solve a wide range of problems!
AMAGO is a high-powered off-policy version of RL^2 for training large policies on long sequences. Please refer to our paper for a detailed explanation. Some highlights:
- Broadly Applicable. Long-term memory, meta-learning, multi-task RL, and zero-shot generalization are all special cases of its POMDP format. Supports discrete, continuous, and multi-binary actions. See examples below!
- Scalable. Train large policies on long context sequences across multiple GPUs with parallel actors, asynchronous learning/rollouts, and large replay buffers stored on disk.
- Easy to Use. Quickstart experiments on a broad range of environments. Technical details are easy to customize but designed to require little hyperparameter tuning.
Standard RL agents can only generalize to aspects of their environment that 1) they can observe and 2) that changed during training. In other words, they cannot adapt to changes that are not explicitly revealed, no matter how much experience we collect or how much variety our environment provides.
Meta-RL agents adapt to changes they cannot directly observe; these might range from subtle adjustments of their controls to entirely new reward functions. They do this by exploring their surroundings, inferring what they do not know, and adjusting their decisions to succeed in their current environment.
In-Context RL (ICRL), a.k.a Black-Box Meta-RL, is a simple approach that lets meta-learning emerge inside a sequence model. The idea is this: RL's goal is to maximize returns, and we could increase returns if we knew more about the environment, so meta-learning will happen naturally. ICRL effectively reduces meta-RL to the problem of training RL with memory. Its main advantage is its flexibility: ICRL blurs formal boundaries between generalization, meta-learning, multi-task RL, and long-term memory by letting us use the same method for every problem!
However, In-Context RL has two key disadvantages:
- Memory in RL is hard, so reducing adaptation to memory may not actually get us very far.
- Sample inefficiency. ICRL is deep RL at its most extreme. We make no assumptions about the problem and let a fancy sequence model figure it out from data... and it'll take a lot of data.
ICRL is not a new idea, but these challenges have limited adoption and prompted research on many alternative approaches. AMAGO is an effort to improve them and push meta-RL beyond toy research problems.
# download source
git clone [email protected]:UT-Austin-RPL/amago.git
# make a fresh conda environment with python 3.10
conda create -n amago python==3.10
conda activate amago
# install core agent
pip install -e amago
There are some optional installs for additional features:
-
pip install -e amago[flash]
: The default Transformer policy has an option for FlashAttention 2.0. FlashAttention leads to significant speedups on long sequences if your GPU is compatible. Please refer to the official installation instructions if you run into issues. -
pip install -e amago[mamba]
: Enables Mamba sequence model policies. -
pip install -e amago[envs]
: AMAGO comes with built-in support for a wide range of existing and custom meta-RL/generalization/memory domains (amago/envs/builtin
) used in our experiments. This command installs (most of) the dependencies you'd need to run theexamples/
.
NOTE: AMAGO requires
gymnasium
<= 0.29. It is not compatible with the recentgymnasium
1.0 release. Please check yourgymnasium
version if you see environment-related error messages on startup.
This is an active long-term research project. Please be warned that the codebase is not stable and we make breaking changes frequently.
You can read a detailed tutorial in tutorial.md. Full documentation coming soon.
The examples/
folder includes helpful starting points for common cases.
To follow most of the examples you'll need to install the benchmark environments with pip install amago[envs]
. If you want to log to wandb
or check out some of the example results, it's worth reading this section of the tutorial. The public wandb
links include example commands (click the "Overview" tab). Building this set of public examples with the new version of AMAGO is an active work in progress.
Use the CUDA_VISIBLE_DEVICES
environment variable to assign basic single-GPU examples to a specific GPU index. Most of the examples share a command line interface. Use --help
for more information.
Learn more about in-context RL with help from an intuitive meta-RL problem. Train an agent to adapt over multiple episodes by learning to avoid its previous mistakes.
Typical RL benchmarks are MDPs and can be treated as a simple special case of the full agent. Memory is often redundant but these tasks can be helpful for testing.
Example wandb
for LunarLander-v2 with a Transformer
Example wandb
for DM Control Suite Cheetah Run
Like gymnasium
, but 1000x faster! Use jax
to add more --parallel_actors
and speedup experiments. gymnax
includes several interesting memory problems.
Example wandb
for MemoryChain-bsuite
π Experimental π. Support for gymnax
is a new feature.
POPGym is a collection of memory unit-tests for RL agents. AMAGO is really good at POPGym and turns most of these tasks into quick experiments for fast prototyping. Our MultiDomainPOPGym
env concatenates POPGym domains into a harder one-shot multi-task problem discussed in the followup paper.
Example wandb
. These settings can be copied across every task in the ICLR paper.
T-Maze is a modified version of the problem featured in Ni et al., 2023. T-Maze answers the question: RL issues (mostly) aside, what is the most distant memory our sequence model can recall? When using Transformers, the answer is usually whatever we can fit on the GPU...
A common meta-RL problem where the environment resets for a fixed number of timesteps (rather than attempts) so that the agent is rewarded for finding a solution quickly in order to finish the task as many times as possible. Loosely based on experiments in Algorithm Distillation.
Symbolic version of the DeepMind Alchemy meta-RL domain.
π₯ Challenging π₯. Alchemy has a hard local max strategy that can take many samples to break. We've found this domain to be very expensive and hard to tune, though we can usually match the pure-RL (VMPO) baseline from the original paper. We've never used Alchemy in our published results but maintain this script as a starting point.
Example wandb
from a recent large-scale attempt with the Multi-Task agent: Actor Process or Learner Process.
Meta-World creates a meta-RL benchmark out of robotic manipulation tasks. Meta-World ML45 is a great example of why we'd want to use the MultiTaskAgent
learning update. For much more information please refer to our NeurIPS 2024 paper.
Example wandb
(MultiTaskAgent
on ML45!).
Multi-Task RL is a special case of meta-RL where the identity of each task is directly provided or can be inferred without memory. We focus on the uncommon setting of learning from unclipped rewards because it isolates the challenge of optimizing distinct reward functions. See the NeurIPS 2024 paper for more.
Example wandb
for an easy 4-game variant
Multi-Game Procgen has a similar feel to Atari. However, Procgen's procedural generation and partial observability (especially in "memory" mode) is better suited to multi-episodic adaptation. This example highlights the TwoShotMTProcgen
setup used by experiments in the second paper.
BabyAI is a collection of procedurally generated gridworld tasks with simple lanugage instructions. We create a fun multi-task variant for adaptive agents.
Example multi-seed report (which uses an outdated version of AMAGO).
XLand-MiniGrid is a jax
-accelerated environment that brings the task diversity of AdA to Minigrid/BabyAI-style gridworlds.
π Experimental π. Support for XLand MiniGrid is a new feature.
A more modern remaster of the famous HalfCheetahVel mujoco meta-RL benchmark, where the cheetah from the HalfCheetah-v4 gymnasium task needs to run at a randomly sampled (hidden) target velocity based on reward signals.
Off-policy learning makes it easy to relabel old sequence data with new rewards. MazeRunner is a goal-conditioned POMDP navigation problem used to discuss & test the hindsight instruction relabeling technique in our paper. This example includes a template for using hindsight relabeling in the new version of AMAGO.
@inproceedings{
grigsby2024amago,
title={{AMAGO}: Scalable In-Context Reinforcement Learning for Adaptive Agents},
author={Jake Grigsby and Linxi Fan and Yuke Zhu},
booktitle={The Twelfth International Conference on Learning Representations},
year={2024},
url={https://openreview.net/forum?id=M6XWoEdmwf}
}
@inproceedings{
grigsby2024amago,
title={{AMAGO}-2: Breaking the Multi-Task Barrier in Meta-Reinforcement Learning with Transformers},
author={Jake Grigsby and Justin Sasek and Samyak Parajuli and Daniel Adebi and Amy Zhang and Yuke Zhu},
booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems},
year={2024},
url={https://openreview.net/forum?id=OSHaRf4TVU}
}