diff --git a/projects/ReinforcementLearning/lunar_lander.ipynb b/projects/ReinforcementLearning/lunar_lander.ipynb
index 30d72f76f..80b6eda54 100644
--- a/projects/ReinforcementLearning/lunar_lander.ipynb
+++ b/projects/ReinforcementLearning/lunar_lander.ipynb
@@ -1,1225 +1,1300 @@
{
- "cells": [
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "execution": {},
- "id": "view-in-github"
- },
- "source": [
- ""
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "execution": {}
- },
- "source": [
- "# Performance Analysis of DQN Algorithm on the Lunar Lander task\n",
- "\n",
- "**By Neuromatch Academy**\n",
- "\n",
- "__Content creators:__ Raghuram Bharadwaj Diddigi, Geraud Nangue Tasse, Yamil Vidal, Sanjukta Krishnagopal, Sara Rajaee\n",
- "\n",
- "__Content editors:__ Shaonan Wang, Spiros Chavlis"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "execution": {}
- },
- "source": [
- "---\n",
- "# Objective\n",
- "\n",
- "In this project, the objective is to analyze the performance of the Deep Q-Learning algorithm on an exciting task- Lunar Lander. Before we describe the task, let us focus on two keywords here - analysis and performance. What exactly do we mean by these keywords in the context of Reinforcement Learning (RL)?"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "execution": {}
- },
- "source": [
- "---\n",
- "# Setup"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "cellView": "form",
- "execution": {}
- },
- "outputs": [],
- "source": [
- "# @title Update/Upgrade the system and install libs\n",
- "!apt-get update > /dev/null 2>&1\n",
- "!apt-get install -y xvfb python-opengl ffmpeg > /dev/null 2>&1\n",
- "!apt-get install -y swig build-essential python-dev python3-dev > /dev/null 2>&1\n",
- "!apt-get install x11-utils > /dev/null 2>&1\n",
- "!apt-get install xvfb > /dev/null 2>&1"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "cellView": "form",
- "execution": {}
- },
- "outputs": [
+ "cells": [
{
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m14.0/14.0 MB\u001b[0m \u001b[31m28.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
- "\u001b[?25h"
- ]
- }
- ],
- "source": [
- "# @title Install dependencies\n",
- "!pip install rarfile --quiet\n",
- "!pip install stable-baselines3[extra] --quiet\n",
- "!pip install ale-py --quiet\n",
- "!pip install gym[box2d] --quiet\n",
- "!pip install pyvirtualdisplay --quiet\n",
- "!pip install pyglet --quiet\n",
- "!pip install pygame --quiet\n",
- "!pip install minigrid --quiet\n",
- "!pip install -q swig --quiet\n",
- "!pip install -q gymnasium[box2d] --quiet\n",
- "!pip install 'minigrid<=2.1.1' --quiet\n",
- "!pip3 install box2d-py --quiet"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "execution": {}
- },
- "outputs": [
+ "cell_type": "markdown",
+ "metadata": {
+ "execution": {},
+ "id": "sgIXqXYCDBuR"
+ },
+ "source": [
+ "# Performance Analysis of DQN Algorithm on the Lunar Lander task\n",
+ "\n",
+ "**By Neuromatch Academy**\n",
+ "\n",
+ "__Content creators:__ Raghuram Bharadwaj Diddigi, Geraud Nangue Tasse, Yamil Vidal, Sanjukta Krishnagopal, Sara Rajaee\n",
+ "\n",
+ "__Content editors:__ Shaonan Wang, Spiros Chavlis"
+ ]
+ },
{
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "/usr/local/lib/python3.10/dist-packages/tensorflow/python/framework/dtypes.py:35: DeprecationWarning: ml_dtypes.float8_e4m3b11 is deprecated. Use ml_dtypes.float8_e4m3b11fnuz\n",
- " from tensorflow.tsl.python.lib.core import pywrap_ml_dtypes\n"
- ]
- }
- ],
- "source": [
- "# Imports\n",
- "import io\n",
- "import os\n",
- "import glob\n",
- "import torch\n",
- "import base64\n",
- "\n",
- "import numpy as np\n",
- "import matplotlib.pyplot as plt\n",
- "\n",
- "import sys\n",
- "import gymnasium\n",
- "sys.modules[\"gym\"] = gymnasium\n",
- "\n",
- "import stable_baselines3\n",
- "from stable_baselines3 import DQN\n",
- "from stable_baselines3.common.results_plotter import ts2xy, load_results\n",
- "from stable_baselines3.common.callbacks import EvalCallback\n",
- "from stable_baselines3.common.env_util import make_atari_env\n",
- "\n",
- "import gymnasium as gym\n",
- "from gym import spaces\n",
- "from gym.envs.box2d.lunar_lander import *\n",
- "from gym.wrappers.monitoring.video_recorder import VideoRecorder"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "cellView": "form",
- "execution": {}
- },
- "outputs": [
+ "cell_type": "markdown",
+ "metadata": {
+ "execution": {},
+ "id": "AKlrTCmFDBuS"
+ },
+ "source": [
+ "---\n",
+ "# Objective\n",
+ "\n",
+ "In this project, the objective is to analyze the performance of the Deep Q-Learning algorithm on an exciting task- Lunar Lander. Before we describe the task, let us focus on two keywords here - analysis and performance. What exactly do we mean by these keywords in the context of Reinforcement Learning (RL)?"
+ ]
+ },
{
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "/usr/local/lib/python3.10/dist-packages/ipykernel/ipkernel.py:283: DeprecationWarning: `should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.\n",
- " and should_run_async(code)\n"
- ]
- }
- ],
- "source": [
- "# @title Play Video function\n",
- "from IPython.display import HTML\n",
- "from base64 import b64encode\n",
- "from pyvirtualdisplay import Display\n",
- "\n",
- "# create the directory to store the video(s)\n",
- "os.makedirs(\"./video\", exist_ok=True)\n",
- "\n",
- "display = Display(visible=False, size=(1400, 900))\n",
- "_ = display.start()\n",
- "\n",
- "\"\"\"\n",
- "Utility functions to enable video recording of gym environment\n",
- "and displaying it.\n",
- "To enable video, just do \"env = wrap_env(env)\"\"\n",
- "\"\"\"\n",
- "def render_mp4(videopath: str) -> str:\n",
- " \"\"\"\n",
- " Gets a string containing a b4-encoded version of the MP4 video\n",
- " at the specified path.\n",
- " \"\"\"\n",
- " mp4 = open(videopath, 'rb').read()\n",
- " base64_encoded_mp4 = b64encode(mp4).decode()\n",
- " return f''"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "execution": {}
- },
- "source": [
- "---\n",
- "# Introduction\n",
- "\n",
- "In a standard RL setting, an agent learns optimal behavior from an environment through a feedback mechanism to maximize a given objective. Many algorithms have been proposed in the RL literature that an agent can apply to learn the optimal behavior. One such popular algorithm is the Deep Q-Network (DQN). This algorithm makes use of deep neural networks to compute optimal actions. In this project, your goal is to understand the effect of the number of neural network layers on the algorithm's performance. The performance of the algorithm can be evaluated through two metrics - Speed and Stability.\n",
- "\n",
- "**Speed:** How fast the algorithm reaches the maximum possible reward.\n",
- "\n",
- "**Stability** In some applications (especially when online learning is involved), along with speed, stability of the algorithm, i.e., minimal fluctuations in performance, is equally important.\n",
- "\n",
- "In this project, you should investigate the following question:\n",
- "\n",
- "**What is the impact of number of neural network layers on speed and stability of the algorithm?**\n",
- "\n",
- "You do not have to write the DQN code from scratch. We have provided a basic implementation of the DQN algorithm. You only have to tune the hyperparameters (neural network size, learning rate, etc), observe the performance, and analyze. More details on this are provided below.\n",
- "\n",
- "Now, let us discuss the RL task we have chosen, i.e., Lunar Lander. This task consists of the lander and a landing pad marked by two flags. The episode starts with the lander moving downwards due to gravity. The objective is to land safely using different engines available on the lander with zero speed on the landing pad as quickly and fuel efficient as possible. Reward for moving from the top of the screen and landing on landing pad with zero speed is between 100 to 140 points. Each leg ground contact yields a reward of 10 points. Firing main engine leads to a reward of -0.3 points in each frame. Firing the side engine leads to a reward of -0.03 points in each frame. An additional reward of -100 or +100 points is received if the lander crashes or comes to rest respectively which also leads to end of the episode.\n",
- "\n",
- "The input state of the Lunar Lander consists of following components:\n",
- "\n",
- " 1. Horizontal Position\n",
- " 2. Vertical Position\n",
- " 3. Horizontal Velocity\n",
- " 4. Vertical Velocity\n",
- " 5. Angle\n",
- " 6. Angular Velocity\n",
- " 7. Left Leg Contact\n",
- " 8. Right Leg Contact\n",
- "\n",
- "The actions of the agents are:\n",
- " 1. Do Nothing\n",
- " 2. Fire Main Engine\n",
- " 3. Fire Left Engine\n",
- " 4. Fire Right Engine\n",
- "\n",
- "\n",
- ""
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "execution": {}
- },
- "source": [
- "---\n",
- "# Basic DQN Implementation\n",
- "\n",
- "We will now implement the DQN algorithm using the existing code base. We encourage you to understand this example and re-use it in an application/project of your choice!"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "execution": {}
- },
- "source": [
- "Now, let us set some hyperparameters for our algorithm. This is the only part you would play around with, to solve the first part of the project."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "execution": {}
- },
- "outputs": [
+ "cell_type": "markdown",
+ "metadata": {
+ "execution": {},
+ "id": "x5XKBDyYDBuS"
+ },
+ "source": [
+ "---\n",
+ "# Setup"
+ ]
+ },
{
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "/usr/local/lib/python3.10/dist-packages/ipykernel/ipkernel.py:283: DeprecationWarning: `should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.\n",
- " and should_run_async(code)\n"
- ]
- }
- ],
- "source": [
- "nn_layers = [64, 64] # This is the configuration of your neural network. Currently, we have two layers, each consisting of 64 neurons.\n",
- " # If you want three layers with 64 neurons each, set the value to [64,64,64] and so on.\n",
- "\n",
- "learning_rate = 0.001 # This is the step-size with which the gradient descent is carried out.\n",
- " # Tip: Use smaller step-sizes for larger networks."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "execution": {}
- },
- "source": [
- "Now, let us setup our model and the DQN algorithm."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "execution": {}
- },
- "outputs": [],
- "source": [
- "log_dir = \"/tmp/gym/\"\n",
- "os.makedirs(log_dir, exist_ok=True)\n",
- "\n",
- "# Create environment\n",
- "env_name = 'LunarLander-v2'\n",
- "env = gym.make(env_name)\n",
- "# You can also load other environments like cartpole, MountainCar, Acrobot.\n",
- "# Refer to https://gym.openai.com/docs/ for descriptions.\n",
- "\n",
- "# For example, if you would like to load Cartpole,\n",
- "# just replace the above statement with \"env = gym.make('CartPole-v1')\".\n",
- "\n",
- "env = stable_baselines3.common.monitor.Monitor(env, log_dir )\n",
- "\n",
- "callback = EvalCallback(env, log_path=log_dir, deterministic=True) # For evaluating the performance of the agent periodically and logging the results.\n",
- "policy_kwargs = dict(activation_fn=torch.nn.ReLU,\n",
- " net_arch=nn_layers)\n",
- "model = DQN(\"MlpPolicy\", env,policy_kwargs = policy_kwargs,\n",
- " learning_rate=learning_rate,\n",
- " batch_size=1, # for simplicity, we are not doing batch update.\n",
- " buffer_size=1, # size of experience of replay buffer. Set to 1 as batch update is not done\n",
- " learning_starts=1, # learning starts immediately!\n",
- " gamma=0.99, # discount facto. range is between 0 and 1.\n",
- " tau = 1, # the soft update coefficient for updating the target network\n",
- " target_update_interval=1, # update the target network immediately.\n",
- " train_freq=(1,\"step\"), # train the network at every step.\n",
- " max_grad_norm = 10, # the maximum value for the gradient clipping\n",
- " exploration_initial_eps = 1, # initial value of random action probability\n",
- " exploration_fraction = 0.5, # fraction of entire training period over which the exploration rate is reduced\n",
- " gradient_steps = 1, # number of gradient steps\n",
- " seed = 1, # seed for the pseudo random generators\n",
- " verbose=0) # Set verbose to 1 to observe training logs. We encourage you to set the verbose to 1.\n",
- "\n",
- "# You can also experiment with other RL algorithms like A2C, PPO, DDPG etc.\n",
- "# Refer to https://stable-baselines3.readthedocs.io/en/master/guide/examples.html\n",
- "# for documentation. For example, if you would like to run DDPG, just replace \"DQN\" above with \"DDPG\"."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "execution": {}
- },
- "source": [
- "Before we train the model, let us look at an instance of Lunar Lander **before training**. \n",
- "\n",
- "**Note:** The following code for rendering the video is taken from [here](https://colab.research.google.com/github/jeffheaton/t81_558_deep_learning/blob/master/t81_558_class_12_01_ai_gym.ipynb#scrollTo=T9RpF49oOsZj)."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "execution": {}
- },
- "outputs": [
+ "cell_type": "code",
+ "execution_count": 1,
+ "metadata": {
+ "cellView": "form",
+ "execution": {},
+ "id": "WsHayfTHDBuS"
+ },
+ "outputs": [],
+ "source": [
+ "# @title Update/Upgrade the system and install libs\n",
+ "!apt-get update > /dev/null 2>&1\n",
+ "!apt-get install -y xvfb python-opengl ffmpeg > /dev/null 2>&1\n",
+ "!apt-get install -y swig build-essential python-dev python3-dev > /dev/null 2>&1\n",
+ "!apt-get install x11-utils > /dev/null 2>&1\n",
+ "!apt-get install xvfb > /dev/null 2>&1"
+ ]
+ },
{
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "State shape: (8,)\n",
- "Number of actions: 4\n"
- ]
+ "cell_type": "code",
+ "execution_count": 21,
+ "metadata": {
+ "execution": {},
+ "id": "6fooEJQSDBuT",
+ "outputId": "73371ac6-9d7e-42e4-acee-5a0636eec589",
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ }
+ },
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "Requirement already satisfied: swig in /usr/local/lib/python3.10/dist-packages (4.2.1)\n",
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m14.0/14.0 MB\u001b[0m \u001b[31m61.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+ "\u001b[?25h"
+ ]
+ }
+ ],
+ "source": [
+ "# @title Install dependencies\n",
+ "!pip install rarfile --quiet\n",
+ "!pip install stable-baselines3[extra] --quiet\n",
+ "!pip install ale-py --quiet\n",
+ "!pip install swig\n",
+ "!pip install gym[box2d] --quiet\n",
+ "!pip install pyvirtualdisplay --quiet\n",
+ "!pip install pyglet --quiet\n",
+ "!pip install pygame --quiet\n",
+ "!pip install minigrid --quiet\n",
+ "!pip install -q swig --quiet\n",
+ "!pip install -q gymnasium[box2d] --quiet\n",
+ "!pip install 'minigrid<=2.1.1' --quiet\n",
+ "!pip3 install box2d-py --quiet"
+ ]
},
{
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "/usr/local/lib/python3.10/dist-packages/ipykernel/ipkernel.py:283: DeprecationWarning: `should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.\n",
- " and should_run_async(code)\n"
- ]
- }
- ],
- "source": [
- "env_name = 'LunarLander-v2'\n",
- "env = gym.make(env_name)\n",
- "print('State shape: ', env.observation_space.shape)\n",
- "print('Number of actions: ', env.action_space.n)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "execution": {}
- },
- "outputs": [
+ "cell_type": "code",
+ "execution_count": 3,
+ "metadata": {
+ "execution": {},
+ "id": "nA2Y9HGUDBuT"
+ },
+ "outputs": [],
+ "source": [
+ "# Imports\n",
+ "import io\n",
+ "import os\n",
+ "import glob\n",
+ "import torch\n",
+ "import base64\n",
+ "\n",
+ "import numpy as np\n",
+ "import matplotlib.pyplot as plt\n",
+ "\n",
+ "import sys\n",
+ "import gymnasium\n",
+ "sys.modules[\"gym\"] = gymnasium\n",
+ "\n",
+ "import stable_baselines3\n",
+ "from stable_baselines3 import DQN\n",
+ "from stable_baselines3.common.results_plotter import ts2xy, load_results\n",
+ "from stable_baselines3.common.callbacks import EvalCallback\n",
+ "from stable_baselines3.common.env_util import make_atari_env\n",
+ "\n",
+ "import gymnasium as gym\n",
+ "from gym import spaces\n",
+ "from gym.envs.box2d.lunar_lander import *\n",
+ "from gym.wrappers.monitoring.video_recorder import VideoRecorder"
+ ]
+ },
{
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "/usr/local/lib/python3.10/dist-packages/gym/wrappers/monitoring/video_recorder.py:101: DeprecationWarning: \u001b[33mWARN: is marked as deprecated and will be removed in the future.\u001b[0m\n",
- " logger.deprecation(\n"
- ]
+ "cell_type": "code",
+ "execution_count": 4,
+ "metadata": {
+ "cellView": "form",
+ "execution": {},
+ "id": "_M-76WwDDBuT",
+ "outputId": "74dde974-1a97-4be7-e7ce-6a2964b602e2",
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ }
+ },
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stderr",
+ "text": [
+ "/usr/local/lib/python3.10/dist-packages/ipykernel/ipkernel.py:283: DeprecationWarning: `should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.\n",
+ " and should_run_async(code)\n"
+ ]
+ }
+ ],
+ "source": [
+ "# @title Play Video function\n",
+ "from IPython.display import HTML\n",
+ "from base64 import b64encode\n",
+ "from pyvirtualdisplay import Display\n",
+ "\n",
+ "# create the directory to store the video(s)\n",
+ "os.makedirs(\"./video\", exist_ok=True)\n",
+ "\n",
+ "display = Display(visible=False, size=(1400, 900))\n",
+ "_ = display.start()\n",
+ "\n",
+ "\"\"\"\n",
+ "Utility functions to enable video recording of gym environment\n",
+ "and displaying it.\n",
+ "To enable video, just do \"env = wrap_env(env)\"\"\n",
+ "\"\"\"\n",
+ "def render_mp4(videopath: str) -> str:\n",
+ " \"\"\"\n",
+ " Gets a string containing a b4-encoded version of the MP4 video\n",
+ " at the specified path.\n",
+ " \"\"\"\n",
+ " mp4 = open(videopath, 'rb').read()\n",
+ " base64_encoded_mp4 = b64encode(mp4).decode()\n",
+ " return f''"
+ ]
},
{
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "\n",
- "Total reward: -597.0358279244006\n"
- ]
+ "cell_type": "markdown",
+ "metadata": {
+ "execution": {},
+ "id": "UJnBSc5KDBuU"
+ },
+ "source": [
+ "---\n",
+ "# Introduction\n",
+ "\n",
+ "In a standard RL setting, an agent learns optimal behavior from an environment through a feedback mechanism to maximize a given objective. Many algorithms have been proposed in the RL literature that an agent can apply to learn the optimal behavior. One such popular algorithm is the Deep Q-Network (DQN). This algorithm makes use of deep neural networks to compute optimal actions. In this project, your goal is to understand the effect of the number of neural network layers on the algorithm's performance. The performance of the algorithm can be evaluated through two metrics - Speed and Stability.\n",
+ "\n",
+ "**Speed:** How fast the algorithm reaches the maximum possible reward.\n",
+ "\n",
+ "**Stability** In some applications (especially when online learning is involved), along with speed, stability of the algorithm, i.e., minimal fluctuations in performance, is equally important.\n",
+ "\n",
+ "In this project, you should investigate the following question:\n",
+ "\n",
+ "**What is the impact of number of neural network layers on speed and stability of the algorithm?**\n",
+ "\n",
+ "You do not have to write the DQN code from scratch. We have provided a basic implementation of the DQN algorithm. You only have to tune the hyperparameters (neural network size, learning rate, etc), observe the performance, and analyze. More details on this are provided below.\n",
+ "\n",
+ "Now, let us discuss the RL task we have chosen, i.e., Lunar Lander. This task consists of the lander and a landing pad marked by two flags. The episode starts with the lander moving downwards due to gravity. The objective is to land safely using different engines available on the lander with zero speed on the landing pad as quickly and fuel efficient as possible. Reward for moving from the top of the screen and landing on landing pad with zero speed is between 100 to 140 points. Each leg ground contact yields a reward of 10 points. Firing main engine leads to a reward of -0.3 points in each frame. Firing the side engine leads to a reward of -0.03 points in each frame. An additional reward of -100 or +100 points is received if the lander crashes or comes to rest respectively which also leads to end of the episode.\n",
+ "\n",
+ "The input state of the Lunar Lander consists of following components:\n",
+ "\n",
+ " 1. Horizontal Position\n",
+ " 2. Vertical Position\n",
+ " 3. Horizontal Velocity\n",
+ " 4. Vertical Velocity\n",
+ " 5. Angle\n",
+ " 6. Angular Velocity\n",
+ " 7. Left Leg Contact\n",
+ " 8. Right Leg Contact\n",
+ "\n",
+ "The actions of the agents are:\n",
+ " 1. Do Nothing\n",
+ " 2. Fire Main Engine\n",
+ " 3. Fire Left Engine\n",
+ " 4. Fire Right Engine\n",
+ "\n",
+ "\n",
+ ""
+ ]
},
{
- "data": {
- "text/html": [
- ""
+ "cell_type": "markdown",
+ "metadata": {
+ "execution": {},
+ "id": "XuSpVWuGDBuU"
+ },
+ "source": [
+ "---\n",
+ "# Basic DQN Implementation\n",
+ "\n",
+ "We will now implement the DQN algorithm using the existing code base. We encourage you to understand this example and re-use it in an application/project of your choice!"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "execution": {},
+ "id": "dYwJRvx-DBuV"
+ },
+ "source": [
+ "Now, let us set some hyperparameters for our algorithm. This is the only part you would play around with, to solve the first part of the project."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "metadata": {
+ "execution": {},
+ "id": "MvWRAJiSDBuV",
+ "outputId": "23422e4b-fa32-4edd-d283-31b62668d30e",
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ }
+ },
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stderr",
+ "text": [
+ "/usr/local/lib/python3.10/dist-packages/ipykernel/ipkernel.py:283: DeprecationWarning: `should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.\n",
+ " and should_run_async(code)\n"
+ ]
+ }
],
- "text/plain": [
- ""
+ "source": [
+ "nn_layers = [64, 64] # This is the configuration of your neural network. Currently, we have two layers, each consisting of 64 neurons.\n",
+ " # If you want three layers with 64 neurons each, set the value to [64,64,64] and so on.\n",
+ "\n",
+ "learning_rate = 0.001 # This is the step-size with which the gradient descent is carried out.\n",
+ " # Tip: Use smaller step-sizes for larger networks."
]
- },
- "execution_count": 8,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "env = gym.make(env_name, render_mode=\"rgb_array\")\n",
- "vid = VideoRecorder(env, path=f\"video/{env_name}_pretraining.mp4\")\n",
- "observation = env.reset()[0]\n",
- "\n",
- "total_reward = 0\n",
- "done = False\n",
- "while not done:\n",
- " frame = env.render()\n",
- " vid.capture_frame()\n",
- " action, states = model.predict(observation, deterministic=True)\n",
- " observation, reward, done, info, _ = env.step(action)\n",
- " total_reward += reward\n",
- "vid.close()\n",
- "env.close()\n",
- "print(f\"\\nTotal reward: {total_reward}\")\n",
- "\n",
- "# show video\n",
- "html = render_mp4(f\"video/{env_name}_pretraining.mp4\")\n",
- "HTML(html)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "execution": {}
- },
- "source": [
- "From the video above, we see that the lander has crashed!\n",
- "It is now the time for training!\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "execution": {}
- },
- "outputs": [
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "execution": {},
+ "id": "EHLx2d5xDBuV"
+ },
+ "source": [
+ "Now, let us setup our model and the DQN algorithm."
+ ]
+ },
{
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "/usr/local/lib/python3.10/dist-packages/ipykernel/ipkernel.py:283: DeprecationWarning: `should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.\n",
- " and should_run_async(code)\n"
- ]
+ "cell_type": "code",
+ "execution_count": 6,
+ "metadata": {
+ "execution": {},
+ "id": "4PzeDS2dDBuV"
+ },
+ "outputs": [],
+ "source": [
+ "log_dir = \"/tmp/gym/\"\n",
+ "os.makedirs(log_dir, exist_ok=True)\n",
+ "\n",
+ "# Create environment\n",
+ "env_name = 'LunarLander-v2'\n",
+ "env = gym.make(env_name)\n",
+ "# You can also load other environments like cartpole, MountainCar, Acrobot.\n",
+ "# Refer to https://gym.openai.com/docs/ for descriptions.\n",
+ "\n",
+ "# For example, if you would like to load Cartpole,\n",
+ "# just replace the above statement with \"env = gym.make('CartPole-v1')\".\n",
+ "\n",
+ "env = stable_baselines3.common.monitor.Monitor(env, log_dir )\n",
+ "\n",
+ "callback = EvalCallback(env, log_path=log_dir, deterministic=True) # For evaluating the performance of the agent periodically and logging the results.\n",
+ "policy_kwargs = dict(activation_fn=torch.nn.ReLU,\n",
+ " net_arch=nn_layers)\n",
+ "model = DQN(\"MlpPolicy\", env,policy_kwargs = policy_kwargs,\n",
+ " learning_rate=learning_rate,\n",
+ " batch_size=1, # for simplicity, we are not doing batch update.\n",
+ " buffer_size=1, # size of experience of replay buffer. Set to 1 as batch update is not done\n",
+ " learning_starts=1, # learning starts immediately!\n",
+ " gamma=0.99, # discount facto. range is between 0 and 1.\n",
+ " tau = 1, # the soft update coefficient for updating the target network\n",
+ " target_update_interval=1, # update the target network immediately.\n",
+ " train_freq=(1,\"step\"), # train the network at every step.\n",
+ " max_grad_norm = 10, # the maximum value for the gradient clipping\n",
+ " exploration_initial_eps = 1, # initial value of random action probability\n",
+ " exploration_fraction = 0.5, # fraction of entire training period over which the exploration rate is reduced\n",
+ " gradient_steps = 1, # number of gradient steps\n",
+ " seed = 1, # seed for the pseudo random generators\n",
+ " verbose=0) # Set verbose to 1 to observe training logs. We encourage you to set the verbose to 1.\n",
+ "\n",
+ "# You can also experiment with other RL algorithms like A2C, PPO, DDPG etc.\n",
+ "# Refer to https://stable-baselines3.readthedocs.io/en/master/guide/examples.html\n",
+ "# for documentation. For example, if you would like to run DDPG, just replace \"DQN\" above with \"DDPG\"."
+ ]
},
{
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "Eval num_timesteps=10000, episode_reward=-420.98 +/- 27.22\n",
- "Episode length: 151.80 +/- 30.46\n",
- "New best mean reward!\n",
- "Eval num_timesteps=20000, episode_reward=-561.62 +/- 27.61\n",
- "Episode length: 878.80 +/- 72.71\n",
- "Eval num_timesteps=30000, episode_reward=-249.88 +/- 48.31\n",
- "Episode length: 240.00 +/- 51.61\n",
- "New best mean reward!\n",
- "Eval num_timesteps=40000, episode_reward=-161.24 +/- 24.32\n",
- "Episode length: 338.20 +/- 107.08\n",
- "New best mean reward!\n",
- "Eval num_timesteps=50000, episode_reward=160.32 +/- 108.81\n",
- "Episode length: 241.20 +/- 55.82\n",
- "New best mean reward!\n",
- "Eval num_timesteps=60000, episode_reward=190.88 +/- 14.49\n",
- "Episode length: 646.80 +/- 65.03\n",
- "New best mean reward!\n",
- "Eval num_timesteps=70000, episode_reward=67.05 +/- 92.04\n",
- "Episode length: 139.80 +/- 35.46\n",
- "Eval num_timesteps=80000, episode_reward=267.52 +/- 20.00\n",
- "Episode length: 321.60 +/- 31.12\n",
- "New best mean reward!\n",
- "Eval num_timesteps=90000, episode_reward=67.08 +/- 126.76\n",
- "Episode length: 536.00 +/- 257.21\n",
- "Eval num_timesteps=100000, episode_reward=259.59 +/- 13.39\n",
- "Episode length: 339.80 +/- 19.18\n"
- ]
+ "cell_type": "markdown",
+ "metadata": {
+ "execution": {},
+ "id": "1FIshtazDBuW"
+ },
+ "source": [
+ "Before we train the model, let us look at an instance of Lunar Lander **before training**. \n",
+ "\n",
+ "**Note:** The following code for rendering the video is taken from [here](https://colab.research.google.com/github/jeffheaton/t81_558_deep_learning/blob/master/t81_558_class_12_01_ai_gym.ipynb#scrollTo=T9RpF49oOsZj)."
+ ]
},
{
- "data": {
- "text/plain": [
- ""
+ "cell_type": "code",
+ "execution_count": 7,
+ "metadata": {
+ "execution": {},
+ "id": "SyD6VwDhDBuW",
+ "outputId": "1689b33b-720e-4d7f-d8e8-56cabfb398f1",
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ }
+ },
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "State shape: (8,)\n",
+ "Number of actions: 4\n"
+ ]
+ },
+ {
+ "output_type": "stream",
+ "name": "stderr",
+ "text": [
+ "/usr/local/lib/python3.10/dist-packages/ipykernel/ipkernel.py:283: DeprecationWarning: `should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.\n",
+ " and should_run_async(code)\n"
+ ]
+ }
+ ],
+ "source": [
+ "env_name = 'LunarLander-v2'\n",
+ "env = gym.make(env_name)\n",
+ "print('State shape: ', env.observation_space.shape)\n",
+ "print('Number of actions: ', env.action_space.n)"
]
- },
- "execution_count": 9,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "model.learn(total_timesteps=100000, log_interval=10, callback=callback)\n",
- "# The performance of the training will be printed every 10 episodes. Change it to 1, if you wish to\n",
- "# view the performance at every training episode."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "execution": {}
- },
- "source": [
- "The training takes time. We encourage you to analyze the output logs (set verbose to 1 to print the output logs). The main component of the logs that you should track is \"ep_rew_mean\" (mean of episode rewards). As the training proceeds, the value of \"ep_rew_mean\" should increase. The improvement need not be monotonic, but the trend should be upwards!\n",
- "\n",
- "Along with training, we are also periodically evaluating the performance of the current model during the training. This was reported in logs as follows:\n",
- "\n",
- "```\n",
- "Eval num_timesteps=100000, episode_reward=63.41 +/- 130.02\n",
- "Episode length: 259.80 +/- 47.47\n",
- "```"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "execution": {}
- },
- "source": [
- "Now, let us look at the visual performance of the lander.\n",
- "\n",
- "**Note:** The performance varies across different seeds and runs. This code is not optimized to be stable across all runs and seeds. We hope you will be able to find an optimal configuration!"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "execution": {}
- },
- "outputs": [
+ },
{
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "/usr/local/lib/python3.10/dist-packages/gym/wrappers/monitoring/video_recorder.py:101: DeprecationWarning: \u001b[33mWARN: is marked as deprecated and will be removed in the future.\u001b[0m\n",
- " logger.deprecation(\n"
- ]
+ "cell_type": "code",
+ "execution_count": 8,
+ "metadata": {
+ "execution": {},
+ "id": "68D-3iePDBuW",
+ "outputId": "3a259ce6-a11c-4027-86b8-be30f9b0d622",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 412
+ }
+ },
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stderr",
+ "text": [
+ "/usr/local/lib/python3.10/dist-packages/gym/wrappers/monitoring/video_recorder.py:101: DeprecationWarning: \u001b[33mWARN: is marked as deprecated and will be removed in the future.\u001b[0m\n",
+ " logger.deprecation(\n",
+ "/usr/lib/python3.10/subprocess.py:1796: RuntimeWarning: os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock.\n",
+ " self.pid = _posixsubprocess.fork_exec(\n"
+ ]
+ },
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "\n",
+ "Total reward: -449.2162305654916\n"
+ ]
+ },
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ ""
+ ],
+ "text/html": [
+ ""
+ ]
+ },
+ "metadata": {},
+ "execution_count": 8
+ }
+ ],
+ "source": [
+ "env = gym.make(env_name, render_mode=\"rgb_array\")\n",
+ "vid = VideoRecorder(env, path=f\"video/{env_name}_pretraining.mp4\")\n",
+ "observation = env.reset()[0]\n",
+ "\n",
+ "total_reward = 0\n",
+ "done = False\n",
+ "while not done:\n",
+ " frame = env.render()\n",
+ " vid.capture_frame()\n",
+ " action, states = model.predict(observation, deterministic=True)\n",
+ " observation, reward, done, info, _ = env.step(action)\n",
+ " total_reward += reward\n",
+ "vid.close()\n",
+ "env.close()\n",
+ "print(f\"\\nTotal reward: {total_reward}\")\n",
+ "\n",
+ "# show video\n",
+ "html = render_mp4(f\"video/{env_name}_pretraining.mp4\")\n",
+ "HTML(html)"
+ ]
},
{
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "\n",
- "Total reward: 252.88935234615718\n"
- ]
+ "cell_type": "markdown",
+ "metadata": {
+ "execution": {},
+ "id": "fhtq8GDLDBuW"
+ },
+ "source": [
+ "From the video above, we see that the lander has crashed!\n",
+ "It is now the time for training!\n"
+ ]
},
{
- "data": {
- "text/html": [
- ""
+ "cell_type": "code",
+ "execution_count": 9,
+ "metadata": {
+ "execution": {},
+ "id": "Xhl3ojMwDBuW",
+ "outputId": "c22a910b-0983-438b-dfb6-3cc20d07992e",
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ }
+ },
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stderr",
+ "text": [
+ "/usr/local/lib/python3.10/dist-packages/ipykernel/ipkernel.py:283: DeprecationWarning: `should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.\n",
+ " and should_run_async(code)\n"
+ ]
+ },
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "Eval num_timesteps=10000, episode_reward=-420.98 +/- 27.22\n",
+ "Episode length: 151.80 +/- 30.46\n",
+ "New best mean reward!\n",
+ "Eval num_timesteps=20000, episode_reward=-561.62 +/- 27.61\n",
+ "Episode length: 878.80 +/- 72.71\n",
+ "Eval num_timesteps=30000, episode_reward=-249.88 +/- 48.31\n",
+ "Episode length: 240.00 +/- 51.61\n",
+ "New best mean reward!\n",
+ "Eval num_timesteps=40000, episode_reward=-161.24 +/- 24.32\n",
+ "Episode length: 338.20 +/- 107.08\n",
+ "New best mean reward!\n",
+ "Eval num_timesteps=50000, episode_reward=160.32 +/- 108.81\n",
+ "Episode length: 241.20 +/- 55.82\n",
+ "New best mean reward!\n",
+ "Eval num_timesteps=60000, episode_reward=190.88 +/- 14.49\n",
+ "Episode length: 646.80 +/- 65.03\n",
+ "New best mean reward!\n",
+ "Eval num_timesteps=70000, episode_reward=67.05 +/- 92.04\n",
+ "Episode length: 139.80 +/- 35.46\n",
+ "Eval num_timesteps=80000, episode_reward=267.52 +/- 20.00\n",
+ "Episode length: 321.60 +/- 31.12\n",
+ "New best mean reward!\n",
+ "Eval num_timesteps=90000, episode_reward=67.08 +/- 126.76\n",
+ "Episode length: 536.00 +/- 257.21\n",
+ "Eval num_timesteps=100000, episode_reward=259.59 +/- 13.39\n",
+ "Episode length: 339.80 +/- 19.18\n"
+ ]
+ },
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {},
+ "execution_count": 9
+ }
],
- "text/plain": [
- ""
+ "source": [
+ "model.learn(total_timesteps=100000, log_interval=10, callback=callback)\n",
+ "# The performance of the training will be printed every 10 episodes. Change it to 1, if you wish to\n",
+ "# view the performance at every training episode."
]
- },
- "execution_count": 10,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "env = gym.make(env_name, render_mode=\"rgb_array\")\n",
- "vid = VideoRecorder(env, path=f\"video/{env_name}_learned.mp4\")\n",
- "observation = env.reset()[0]\n",
- "\n",
- "total_reward = 0\n",
- "done = False\n",
- "while not done:\n",
- " frame = env.render()\n",
- " vid.capture_frame()\n",
- " action, states = model.predict(observation, deterministic=True)\n",
- " observation, reward, done, info, _ = env.step(action)\n",
- " total_reward += reward\n",
- "vid.close()\n",
- "env.close()\n",
- "print(f\"\\nTotal reward: {total_reward}\")\n",
- "\n",
- "# show video\n",
- "html = render_mp4(f\"video/{env_name}_learned.mp4\")\n",
- "HTML(html)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "execution": {}
- },
- "source": [
- "The lander has landed safely!!\n",
- "\n",
- "Let us analyze its performance (speed and stability). For this purpose, we plot the number of time steps on the x-axis and the episodic reward given by the trained model on the y-axis."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "execution": {}
- },
- "outputs": [
+ },
{
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "/usr/local/lib/python3.10/dist-packages/ipykernel/ipkernel.py:283: DeprecationWarning: `should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.\n",
- " and should_run_async(code)\n"
- ]
+ "cell_type": "markdown",
+ "metadata": {
+ "execution": {},
+ "id": "IYynM83tDBuX"
+ },
+ "source": [
+ "The training takes time. We encourage you to analyze the output logs (set verbose to 1 to print the output logs). The main component of the logs that you should track is \"ep_rew_mean\" (mean of episode rewards). As the training proceeds, the value of \"ep_rew_mean\" should increase. The improvement need not be monotonic, but the trend should be upwards!\n",
+ "\n",
+ "Along with training, we are also periodically evaluating the performance of the current model during the training. This was reported in logs as follows:\n",
+ "\n",
+ "```\n",
+ "Eval num_timesteps=100000, episode_reward=63.41 +/- 130.02\n",
+ "Episode length: 259.80 +/- 47.47\n",
+ "```"
+ ]
},
{
- "data": {
- "image/png": "\n",
- "text/plain": [
- "