Skip to content

pyTorch implementation of the famous Proximal Policy Gradient inspired from OpenAI spinningup

Notifications You must be signed in to change notification settings

AmineDiro/Proximal-Policy-Gradient

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Open In Colab

Implementation of Proximal Policy paper

  • PPO is a policy gradient method for reinforcement learning.

  • PPO is motivated by two challenge:

    • reducing the sample estimate variance by implementing a modified version of the GAE algorithm
    • taking the biggest possible improvement step on a policy using the data we currently have, without stepping so far that we accidentally cause performance collapse
  • PPO lets us do multiple gradient updates per sample by trying to keep the policy close to the policy that was used to sample data. It does so by clipping gradient flow if the updated policy is not close to the policy used to sample the data.

  • References to PPO :

Training / Testing

Project Organization

PPO_Project directory is structured as follows

├── models                  <- Saved pytorch models, loaded when testing 
└── SimplePG                <- Simple Policy Gradient 
    ├── Actor.py            <- Policy architecture, 2 layer NN
    ├── run.py              <- Used when testing trained model
    └── train.py            <- Code for updating policy
├── PPO
│   ├── ActorCritic.py      <- Policy and Value function architecture, 2 layer NN
│   ├── PPOBuffer.py        <- Buffer class needed to store obs,ac,pi and compute advantage
│   ├── run.py              <- Used when testing trained model
│   └── train.py            <- Code for updating policy
├── ppo.yaml                <- Miniconda env dependencies
├── results                 <- Directory for results
├── setup.py                <- Run to setup environement

To run the training follow these steps :

  1. Clone the repository and cd to the directory

    git clone https://github.com/AmineDiro/Proximal-Policy-Gradient.git 
    cd ./Proximal-Policy-Gradient
  2. Create conda env

    conda env create -f environment.yml
    conda activate test
  3. The training has different arguments , for running ppo use command python -m PPO --train, you can choose from the list of arguments below, some are only available for Simple Gradient Policy

  4. You can also test pretrained SPG or PPO algorithms by running python -m PPO --env , followed by the name of the environement : CartPole-v0 or LunarLander-v2

Arguments

Short Long Description Default PPO SPG
--env Discrete action type environement "CartPole-v0" ✔️ ✔️
-e --epochs Epochs to run training 5000 ✔️ ✔️
-b --batch_size Batch size for training (N*T) 2 ✔️ ✔️
-se --save_epoch Saving model every N epoch 10 ✔️ ✔️
--train Put this Flag to train model False ✔️ ✔️
-r --render Put this flag to avoid visualizing first epoch of training False ✔️
--max_len Max episode length 1000 ✔️
--lr Learning rate default 1e-2 1e-2 ✔️

About

pyTorch implementation of the famous Proximal Policy Gradient inspired from OpenAI spinningup

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published