Implementation of Proximal Policy paper

PPO is a policy gradient method for reinforcement learning.
PPO is motivated by two challenge:
- reducing the sample estimate variance by implementing a modified version of the GAE algorithm
- taking the biggest possible improvement step on a policy using the data we currently have, without stepping so far that we accidentally cause performance collapse
PPO lets us do multiple gradient updates per sample by trying to keep the policy close to the policy that was used to sample data. It does so by clipping gradient flow if the updated policy is not close to the policy used to sample the data.
References to PPO :
- Proximal Policy , Schulman et al. 2017.
- Trust Region Policy Optimization, Schulman et al. 2015.
- High Dimensional Continuous Control Using Generalized Advantage Estimation, Schulman et al. 2016.

Training / Testing

Project Organization

PPO_Project directory is structured as follows

├── models                  <- Saved pytorch models, loaded when testing 
└── SimplePG                <- Simple Policy Gradient 
    ├── Actor.py            <- Policy architecture, 2 layer NN
    ├── run.py              <- Used when testing trained model
    └── train.py            <- Code for updating policy
├── PPO
│   ├── ActorCritic.py      <- Policy and Value function architecture, 2 layer NN
│   ├── PPOBuffer.py        <- Buffer class needed to store obs,ac,pi and compute advantage
│   ├── run.py              <- Used when testing trained model
│   └── train.py            <- Code for updating policy
├── ppo.yaml                <- Miniconda env dependencies
├── results                 <- Directory for results
├── setup.py                <- Run to setup environement

To run the training follow these steps :

Clone the repository and cd to the directory

git clone https://github.com/AmineDiro/Proximal-Policy-Gradient.git 
cd ./Proximal-Policy-Gradient

Create conda env

conda env create -f environment.yml
conda activate test

The training has different arguments , for running ppo use command python -m PPO --train, you can choose from the list of arguments below, some are only available for Simple Gradient Policy
You can also test pretrained SPG or PPO algorithms by running python -m PPO --env , followed by the name of the environement : CartPole-v0 or LunarLander-v2

Arguments

Short	Long	Description	Default	PPO	SPG
--env		Discrete action type environement	"CartPole-v0"	✔️	✔️
-e	--epochs	Epochs to run training	5000	✔️	✔️
-b	--batch_size	Batch size for training (N*T)	2	✔️	✔️
-se	--save_epoch	Saving model every N epoch	10	✔️	✔️
--train		Put this Flag to train model	False	✔️	✔️
-r	--render	Put this flag to avoid visualizing first epoch of training	False	✅	✔️
--max_len		Max episode length	1000	✅	✔️
--lr		Learning rate default 1e-2	1e-2	✅	✔️

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.vscode		.vscode
PPO		PPO
SimplePG		SimplePG
models		models
results		results
Readme.md		Readme.md
environment.yaml		environment.yaml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Implementation of Proximal Policy paper

Training / Testing

Project Organization

Arguments

About

Releases

Packages

Languages

AmineDiro/Proximal-Policy-Gradient

Folders and files

Latest commit

History

Repository files navigation

Implementation of Proximal Policy paper

Training / Testing

Project Organization

Arguments

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages