Skip to content

A PPO Agent Tutorial

idvhfd edited this page Jul 24, 2018 · 8 revisions

This section describes how to create, step by step, a ChainerRL PPO agent. Note that you need to have ChainerRL set up before being able to create agents (check here).

1. Imports and variables

Let us first import PPO, its subsidiary A3C, as well as other chainer and chainerRL-related classes. We will also import numpy and marlo as these will be used later.

from chainerrl.agents import a3c
from chainerrl.agents import PPO
from chainerrl import links
from chainerrl import misc
from chainerrl.optimizers.nonbias_weight_decay import NonbiasWeightDecay
from chainerrl import policies
import chainer

import logging
import sys
import argparse

import gym
from gym.envs.registration import register

import numpy as np
import marlo
from marlo import experiments
import time

# Tweakable parameters, can be turned into args if needed
gpu = 0
steps = 10 ** 6
eval_n_runs = 10
eval_interval = 10000
update_interval = 2048
outdir = 'results'
lr = 3e-4
bound_mean = False
normalize_obs = False

2. Declaring a policy

We will use the A3C feedforward softmax policy, and this will be implemented in a standard fashion as below:

class A3CFFSoftmax(chainer.ChainList, a3c.A3CModel):
    """An example of A3C feedforward softmax policy."""
    def __init__(self, ndim_obs, n_actions, hidden_sizes=(200, 200)):
        self.pi = policies.SoftmaxPolicy(
            model=links.MLP(ndim_obs, n_actions, hidden_sizes))
        self.v = links.MLP(ndim_obs, 1, hidden_sizes=hidden_sizes)
        super().__init__(self.pi, self.v)

    def pi_and_v(self, state):
        return self.pi(state), self.v(state)

3. Creating the environment

First, let us create a phi function that transforms items to float32 (since ChainerRL uses float32, but Gym uses float64!)

def phi(obs):
    return obs.astype(np.float32)

For testing purposes and not only, we've used command line arguments as parameters for our agents, and that stuck around. It's good practice (and could be extended to the parameters above), so let's do that as well:

parser = argparse.ArgumentParser(description='Multi-agent chainerrl DQN example')
# Example missions: 'pig_chase.xml' or 'bb_mission_1.xml' or 'th_mission_1.xml'
parser.add_argument('--rollouts', type=int, default=1, help='number of rollouts')
parser.add_argument('--mission_file', type=str, default="basic.xml", help='the mission xml')
parser.add_argument('--turn_based', action='store_true')
args = parser.parse_args()

turn_based = args.turn_based
number_of_rollouts = args.rollouts	

You can now declare any environment name that you like, as long as it does not clash with any of our pre-registered environments which you can find here. This is not mandatory, but is definitely a valid option:

# Ensure that you have a minecraft-client running with : marlo-server --port 10000
env_name = 'debug-v0'

register(
    id=env_name,
    entry_point='marlo.envs:MinecraftEnv',
    # Make sure mission xml is in the marlo/assets directory.
    kwargs={'mission_file': args.mission_file}
)

With that out of the way, let us create the environment in a typical Gym fashion:

# Ensure that you have a minecraft-client running with : marlo-server --port 10000
env = gym.make('MinecraftCliffWalking1-v0')

env.init(
    allowContinuousMovement=["move", "turn"],
    videoResolution=[800, 600]
    )

or, in the case that you've declared and registered your own environment name:

env = gym.make(env_name)

resolution = [84, 84]  # [800, 600]
config = {'allowDiscreteMovement': ["move", "turn"], 'videoResolution': resolution, "turn_based": turn_based}

env.init(**config)

Marlo environments support a wide range of initialization parameters, as seen here. You can use any of these in the env_init() function.

Currently, the number of available environments is limited and their string titles can all be found here. Feel free to swap any of these in the gym.make("") call at the beginning of the file in order to select a different mission to train on.

Finally, let us render the environment and print out some helpful statistics.

obs = env.reset()
env.render()
print('initial observation:', obs)

action = env.action_space.sample()
obs, r, done, info = env.step(action)
print('next observation:', obs)
print('reward:', r)
print('done:', done)
print('info:', info)

print('actions:', str(env.action_space))

The print comments are there solely for debugging reasons, they tend to be rather helpful when something goes wrong whilst trying to kick an environment off.

4. Initialize the agent

In order to create a PPO agent, we must initialize it. ChainerRL's PPO agent class requires a model parameter, which is represented here by our chosen softmax policy. Therefore, we need to instantiate our policy for use in the agent:

timestep_limit = env.spec.tags.get(
        'wrapper_config.TimeLimit.max_episode_steps'
		)
obs_space = env.observation_space
action_space = env.action_space

model = A3CFFSoftmax(obs_space.low.size, action_space.n)

We should also use an optimizer for the policy. In this case we're using the Adam algorithm:

opt = chainer.optimizers.Adam(alpha=lr, eps=1e-5)
opt.setup(model)

Finally, we initialize PPO with the policy, optimizer and pre-set variables as declared at the top of the file.

# Initialize the agent
agent = PPO(
            model, opt,
            gpu=gpu,
            phi=phi,
            update_interval=update_interval,
            minibatch_size=64, epochs=10,
            clip_eps_vf=None, entropy_coef=0.0,
        )

5. Decay the learning and cliping rate linearly

This step is simply used as part of the implementation of PPO, which supposes a linear decay for the learning rate towards zero:

# Linearly decay the learning rate to zero
def lr_setter(env, agent, value):
   agent.optimizer.alpha = value

lr_decay_hook = experiments.LinearInterpolationHook(
   steps, 3e-4, 0, lr_setter)

and a linear decay of the clipping rate towards zero:

# Linearly decay the clipping parameter to zero
def clip_eps_setter(env, agent, value):
   agent.clip_eps = value

clip_eps_decay_hook = experiments.LinearInterpolationHook(
   steps, 0.2, 0, clip_eps_setter)

6. Start training!

We should loop over the number of episodes and timesteps as initialized at the beginning of this file whilst calling the act() method of the PPO as we go, which can be rather cumbersome. Fortunately, ChainerRL provides an easy way to do this via its experiments pack. Let us call the train_agent_with_evaluation() function on our PPO:

# Start training/evaluation
experiments.train_agent_with_evaluation(
   agent=agent,
   env=env,
   eval_env=env,
   outdir=outdir,
   steps=steps,
   eval_n_runs=eval_n_runs,
   eval_interval=eval_interval,
   max_episode_len=timestep_limit,
   step_hooks=[
   	lr_decay_hook,
   	clip_eps_decay_hook,
   ],
)

If you would like to draw out the computational graph of the agent, you may use ChainerRL's draw_computational_graph function. Their official example for this is:

# Draw the computational graph and save it in the output directory.
chainerrl.misc.draw_computational_graph(
	[q_func(np.zeros_like(obs_space.low, dtype=np.float32)[None])],
	os.path.join(outdir, 'model')
	)

Et voila! Your agent is now ready to start aggressively walking towards walls for weeks on end as it finds its way through the complex jungle that Minecraft gameplay is!