Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds some extra notes on RL training #1

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 30 additions & 2 deletions 08 LectureReinforcementLearning.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -1176,8 +1176,36 @@
"## Experience Replay\n",
"Each state, action, reward and new state (known as transitions) are saved in a \"replay memory\", and at each step, a random sample of transitions are taken to train the network with. This is known as experience replay, and has a few benefits, including greater data efficiency (each state transition is used more than once) and more efficient learning (randomly sampled states are less correlated than sequential states). This also avoids oscillation and divergence, because the current state is not entirely dependent on the model's parameters at that time. The replay memory can holds the last 1,000,000 frames.\n",
"\n",
"Perhaps the most impactful outcome of experience replay is breaking undesirable temporal correlations amongst frames exposed to the network in training. [Liu, Zou, 2017](https://arxiv.org/abs/1710.06574#:~:text=Experience%20replay%20is%20a%20key,the%20properties%20of%20experience%20replay)\n",
"\n",
"## Target Network\n",
"The authors of DQN followed up with another technique in which 2 separate Q networks are used, one to train, and one to calculate the target value during training. Every 10,000 steps, the parameters from the trained network are copied over to the target network. This also avoids oscillation and divergence.\n"
"The authors of DQN followed up with another technique in which 2 separate Q networks are used, one to train, and one to calculate the target value during training. Every 10,000 steps, the parameters from the trained network are copied over to the target network. This also avoids oscillation and divergence. \n",
"\n",
"A useful analogy for the problem here is to think of the network as a dog chasing it's own tail. The Q network is used in the calculation of the Q loss in the Bellman equation. Due to this as the network evolves so does the Q matrix that the network is attempting to model."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"# Advanced Training Techniques\n",
"\n",
"## Hindsight Experience Replay\n",
"Reinforcement learning tasks often have extremely sparse rewards. This often causes the network to be stagnant for extremely long periods of exploration - particularly when performing difficult tasks. [Hindsight Experirence Replay \\[Andrychowicz 2017, (OpenAI)\\]](https://papers.nips.cc/paper/7090-hindsight-experience-replay.pdf) was proposed as a method to mitigate issue.\n",
"\n",
"In Hindsight Experience Replay (HER) the training algorithm starts out like normal experience replay. Tuples of frames, actions, state, and next state are taken from training and used to populate the replay memory.\n",
"\n",
"HER differs from Experience Replay in it's usage of the stored frames. Instead of plainly using the stored frames in training the frames are used with an artificially inflated reward. The inflated reward decreases over the duration of training. \n",
"\n",
"This technique encourages the model to build a correlation between actions and states. It is thought that this improves the networks internal representation of the task state. Even if the action taken is incorrect this is movement in the right direction for the model and ultimately the model will correct these issues later in training.\n",
"\n",
"## Curiousity Driven Agents\n",
"[Curiousity driven agents](https://pathak22.github.io/noreward-rl/) were first proposed to overcame sparse reward functions in environments. In their first iteration **completely ignoring there reward function**. These agents performed more effectively than traditional RL agents on a wide variety of models including Mario [Pathak 2017 (UC Berkeley)](https://pathak22.github.io/noreward-rl/). \n",
"\n",
"These agents were improved upon by combining this technique with existing techniques as well as refining the \"curiosity\" function to get state of the art results [Savinov 2018 (Google Brain)](https://ai.googleblog.com/2018/10/curiosity-and-procrastination-in.html).\n",
"\n",
"Curiosity driven agents are rewarded for finding states for which they form an ineffective internal representation. There are various methods out there to implement this - but the core idea is to motivate the agent to find states it has never seen before."
]
},
{
Expand Down Expand Up @@ -1889,7 +1917,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.8"
"version": "3.7.6"
}
},
"nbformat": 4,
Expand Down