-
PPO is a policy gradient method for reinforcement learning.
-
PPO is motivated by two challenge:
- reducing the sample estimate variance by implementing a modified version of the GAE algorithm
- taking the biggest possible improvement step on a policy using the data we currently have, without stepping so far that we accidentally cause performance collapse
-
PPO lets us do multiple gradient updates per sample by trying to keep the policy close to the policy that was used to sample data. It does so by clipping gradient flow if the updated policy is not close to the policy used to sample the data.
-
References to PPO :
- Proximal Policy , Schulman et al. 2017.
- Trust Region Policy Optimization, Schulman et al. 2015.
- High Dimensional Continuous Control Using Generalized Advantage Estimation, Schulman et al. 2016.
PPO_Project
directory is structured as follows
├── models <- Saved pytorch models, loaded when testing
└── SimplePG <- Simple Policy Gradient
├── Actor.py <- Policy architecture, 2 layer NN
├── run.py <- Used when testing trained model
└── train.py <- Code for updating policy
├── PPO
│ ├── ActorCritic.py <- Policy and Value function architecture, 2 layer NN
│ ├── PPOBuffer.py <- Buffer class needed to store obs,ac,pi and compute advantage
│ ├── run.py <- Used when testing trained model
│ └── train.py <- Code for updating policy
├── ppo.yaml <- Miniconda env dependencies
├── results <- Directory for results
├── setup.py <- Run to setup environement
To run the training follow these steps :
-
Clone the repository and cd to the directory
git clone https://github.com/AmineDiro/Proximal-Policy-Gradient.git cd ./Proximal-Policy-Gradient
-
Create conda env
conda env create -f environment.yml conda activate test
-
The training has different arguments , for running ppo use command
python -m PPO --train
, you can choose from the list of arguments below, some are only available for Simple Gradient Policy -
You can also test pretrained
SPG
orPPO
algorithms by runningpython -m PPO --env
, followed by the name of the environement :CartPole-v0
orLunarLander-v2
Short | Long | Description | Default | PPO | SPG |
---|---|---|---|---|---|
--env | Discrete action type environement | "CartPole-v0" | ✔️ | ✔️ | |
-e | --epochs | Epochs to run training | 5000 | ✔️ | ✔️ |
-b | --batch_size | Batch size for training (N*T) | 2 | ✔️ | ✔️ |
-se | --save_epoch | Saving model every N epoch | 10 | ✔️ | ✔️ |
--train | Put this Flag to train model | False | ✔️ | ✔️ | |
-r | --render | Put this flag to avoid visualizing first epoch of training | False | ✅ | ✔️ |
--max_len | Max episode length | 1000 | ✅ | ✔️ | |
--lr | Learning rate default 1e-2 | 1e-2 | ✅ | ✔️ |