Skip to content

Latest commit

 

History

History
21 lines (15 loc) · 1.11 KB

QLearning.md

File metadata and controls

21 lines (15 loc) · 1.11 KB

Q-Learning

Overview

Q-learning is an Off-Policy Temporal Difference Value-Based algorithm that learns an action-value function to find the optimal policy.
It has an Experience Replay to sample experiences from different policies to calculate the TD Erros.
The action-value function is updated using MSE of the TD Errors.
The most common implicity policy is the e-greedy but this implementation has an Bayesian approach using Dropout.

References

WATKINS, Christopher. Learning from Delayed Rewards. Cambridge, UK: King’s College, 1989.
MNIH, Volodymyr; et al.Playing Atari with Deep Reinforcement Learning. 2013
MNIH, Volodymyr; et al. Human-level control through deep reinforcement learning. Nature, Vol 518, fev. 2015.