-
Notifications
You must be signed in to change notification settings - Fork 6
Reproducing experiments
This wiki provides instructions on how to reproduce most of the experiments presented in the NeurIPS 2022 Offline RL workshop paper Towards Data-Driven Offline Simulations for Online Reinforcement Learning by Shengpu Tang, Felipe Vieira Frujeri, Dipendra Misra, Alex Lamb, John Langford, Paul Mineiro, Sebastian Kochman.
To reproduce the experiments, please use the neurips_workshop_2022 branch, which includes notebooks with the experiments. The notebooks will be removed from the main
branch, such that it is easier to continue developing the offsim4rl
library without breaking the workshop experiments.
Figure 2 in the paper illustrates fidelity vs efficiency trade-off between different simulations. See appendix B.1. in the paper for details.
To see how this figure was produced, see notebook notebooks/metrics.ipynb. Note that the notebook collects experience in-memory and hence requires a lot of RAM to run: we used a VM with 56GB RAM.
In order to reproduce Figure 4 from the paper, in which we represent the visitation for both the observations in the continuous grid and the corresponding latent state visitation (after encoding the observations using HOMER).
In order to train the HOMER encoder, we can use a random agent as the behavior policy to collect the data, to reproduce this data collection you can run this script:
python examples/continuous_grid/random_agent_rollout.py
To train the encoder from scratch, please refer to the examples/continuous_grid/train_homer_encoder.py with the following parameters:
python examples/continuous_grid/train_homer_encoder.py --num_epochs=1000 --seed=0 --batch_size=64 --latent_size=50 --hidden_size=64 --lr=1e-3 --weight_decay=0.0 --temperature_decay=False --output_dir='outputs/models' --num_samples=100000
Alternatively, you can use the encoder model checkpoint to encode observations in the dataset collected above. To reproduce Figure 4, visualizing both the original observation visitation and the latent state representation captured by the encoder, use the notebooks/grid-ppo-plot.ipynb.
For these experiments we trained a PPO agent with interactions provided by the PSRS-based environment (withing the latent space decoded by HOMER on the continuous grid task) and measured the average episode return within each training epoch
And the average episode return as measured in a real validation environment (after each training epoch)
To reproduce those curves, you need to train the ppo agent (we used the spinup one) passing the PSRS environment, like in this notebook. Once you have the "progress.txt" file logged by the spinup framework for each of the seeds ([0...10]), you can refer to the notebook to plot both learning and validation curves.