Uses a distributed version of the deep reinforcement learning algorithm PPO to control a grid of traffic lights for optimized traffic flow through the system. The traffic enviornment is implemented in the realistic traffic simulation SUMO. Multi-agent RL (MARL) is implemented with each traffic light acting as a single agent.
SUMO (Simulation of Urban MObility) is a continuous road traffic simulation. TraCI (Traffic Control Interface) connects to a SUMO simulation in a programming language (in this case Python) to allow for feeding inputs and recieving outputs.
The environments implemented for this problem are grids where an intersection is controlled by a traffic light. Either NS cars can go or EW cars, at a time. So each intersection has 2 possible configurations. Cars spawn at the edges and then have a predefined destination edge where they despawn.
Proximal Policy Optimization (PPO) is a policy gradient based reinforcement learning algorithm created by OpenAI. It is efficient and fairly simple and tends to be the goto for RL nowadays. There are a lot of great tutorials and code on PPO (this, this and many more).
DISCLAIMER: The DPPO implementation here is incorrect. It does not properly aggregate the gradients during training.
Distributed algorithms use multiple processes to speed up existing algorithms such as PPO. There arent as many simple resources on DPPO but I used a few different sources noted in my code such as this repo. I first implemented single-agent RL which means that in a single environment there is only one agent. In this apps case, this means all traffic lights are controlled by one agent. However, that means as the grid size increases the action size increases exponentially.
For example, the action space for a single intersection is 2 as either the NS light can be green or the EW light can be greed.
The number of actions for a 2x2 grid is 2^4 = 16. For example if 1 means NS is green and 0 means EW is green. Then 1011 in binary (13 in decimal) would mean that 3 of the 4 intersections are NS green. This can become a problem as the grid gets even larger.
Cooperative MARL is a way to fix this "curse of dimensionality" problem. With MARL there are multiple agents in the environment. And in this case each agent controls a single intersection. So now an agent only has 2 possible actions no matter how big the grid gets! MARL also helps with inputs. Instead of a single agent needing to be trained to deal with say 4 states (for a 2x2 grid) it can just deal with one. MARL is a great tool in cases where your problem can run into scaling issues.
In the case of this repo, I use independent MARL which means each agent does not directly communicate. However, each actor and critic share parameters across all agents. One trick for better cooperation is to share certain info across agents (other than just weights). Reward and states are two popular items to share. This post by Berkeley goes into this more.
- numpy
- traci
- sumolib
- scipy
- pytorch
- pandas
Can alter constants.json
or constans-grid.json
in /constants to change different hyperparameters. In main.py
can run experiments with run_normal
(runs multiple experiments using constants.json
), run_random_search
(runs a random search on constants-grid.json
) or run_grid_search
(runs a grid search on constants-grid.json
). Can save and load models. Can also visualize models by running vis_agent.py
and changing run(load_model_file=<MODEL FILE NAME>)
to the model file. The 4 envs implemented are 1x1, 2x2, 3x3 and 4x4.
shape
is the grid, rush_hour
can be set to true for 2x2 which adds a kind of rush-hour spawning probability distribution. And uniform_generation_probability
is the spawn rate for cars when rush_hour
is false.
"environment": {
"shape": [4, 4],
"rush_hour": false,
"uniform_generation_probability": 0.06
},
Change num_workers
based on how many processes you want active for the distribibuted part of DPPO.
"parallel":{
"num_workers": 8
}
Finally, you can change the agent_type
to rule
if you want a simple rule based agent to run (which just changes each light after a set amount of time). And can change single_agent
to true to not use MARL.
"agent": {
"agent_type": "ppo",
"single_agent": false
},