Skip to content

Commit

Permalink
modr work on RL.
Browse files Browse the repository at this point in the history
  • Loading branch information
mhahsler committed Jan 5, 2024
1 parent 11c8e3b commit 1bc4d9e
Show file tree
Hide file tree
Showing 13 changed files with 6,343 additions and 1,426 deletions.
815 changes: 436 additions & 379 deletions RL/MDP.html

Large diffs are not rendered by default.

414 changes: 37 additions & 377 deletions RL/MDP.qmd

Large diffs are not rendered by default.

1,300 changes: 908 additions & 392 deletions RL/QLearning.html

Large diffs are not rendered by default.

409 changes: 131 additions & 278 deletions RL/QLearning.qmd

Large diffs are not rendered by default.

4,170 changes: 4,170 additions & 0 deletions RL/TD-Control.html

Large diffs are not rendered by default.

261 changes: 261 additions & 0 deletions RL/TD-Control.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,261 @@
---
title: "Reinforcement Learning: TD Control with Sarsa"
author: "Michael Hahsler"
format:
html:
theme: default
toc: true
number-sections: true
code-line-numbers: true
embed-resources: true
---

This code is provided under [Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) License.](https://creativecommons.org/licenses/by-sa/4.0/)

![CC BY-SA 4.0](https://licensebuttons.net/l/by-sa/3.0/88x31.png)
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(tidy = TRUE)
options(digits = 2)
```

# Introduction

[Reinforcement Learning: An Introduction (RL)](http://incompleteideas.net/book/the-book-2nd.html) by Sutton and Barto (2020) introduce several temporal-difference learning control algorithms in
Chapter 6. Here we implement the on-policy TD control algorithm Sarsa.

We will implement the
key concepts using R for the AIMA 3x4 grid world example. The environment is an MDP, but instead of trying to solve the MDP and estimate the value function estimates $U(s)$,
we will try to learn the Q-function $Q(s,a)$ directly.
The Q-function the sum of rewards starting from $s$ if action $a$ is taken.
The transition model will only be used to simulate the response of the environment
to actions by the agent.

The code in this notebook defines explicit functions
matching the textbook definitions and is for demonstration purposes only. Efficient implementations for larger problems use fast vector multiplications
instead.


{{< include _AIMA-4x3-gridworld.qmd >}}

# Implementing the Temporal-Difference Learning Algorithm

Here is the pseudo code for Sarsa from the RL book,
Chapter 6.4:

![Reinforcement Learning Chapter 6.4: Sarsa](figures/RL_Sarsa.png)

The algorithm uses a __temporal-difference (TD) learning__ since it updates
using the TD error given by $Q(S',A') - Q(S, A)$.

The algorithm performs __on-policy learning__ since it uses
only a single policy (e.g., $\epsilon$-greedy) as the behavior and
target policy.


## Behavior and Target Policies

Next, we implement the greedy and $\epsilon$-greedy policy action choice
given $Q$. The greedy policy is deterministic and always chooses the best
action with the highest
$Q$-value for the current state.
The $\epsilon$-greedy policy is a stochastic policy which chooses the best
action with probability $1 - \epsilon$ and a random action otherwise.
Setting $\epsilon = 0$ reduces the $\epsilon$-greedy policy to a
deterministic greedy policy.

```{r}
greedy_action <- function(s, Q, epsilon = 0) {
available_A <- actions(s)
if (epsilon == 0 ||
length(available_A) == 1L || runif(1) > epsilon) {
a <- available_A[which.max(Q[s, available_A])]
} else {
a <- sample(available_A, size = 1L)
}
a
}
greedy_prob <- function(s, Q, epsilon = 0) {
p <- structure(rep(0, length(A)), names = A)
available_A <- actions(s)
a <- available_A[which.max(Q[s, available_A])]
p[a] <- 1 - epsilon
p[available_A] <- p[available_A] + epsilon / length(available_A)
p
}
```

## The TD Learning Algorithm

The temporal-difference learning algorithm follows the Sarsa implementation,
but just changing how the TD error is calculated lets the algorithm
also perform Q-learning and expected Sarsa.

```{r}
DT_learning <- function(method = "sarsa",
alpha = 0.1,
epsilon = 0.1,
gamma = 1,
N = 100,
verbose = FALSE) {
method <- match.arg(method, c("sarsa", "q", "expected_sarsa"))
# Initialize Q
Q <-
matrix(
NA_real_,
nrow = length(S),
ncol = length(A),
dimnames = list(S, A)
)
for (s in S)
Q[s, actions(s)] <- 0
# loop episodes
for (e in seq(N)) {
s <- start
a <- greedy_action(s, Q, epsilon)
# loop steps in episode
i <- 1L
while (TRUE) {
s_prime <- sample_transition(s, a)
r <- R(s, a, s_prime)
a_prime <- greedy_action(s_prime, Q, epsilon)
if (verbose) {
if (step == 1L)
cat("\n*** Episode", e, "***\n")
cat("Step", i, "- s a r s' a':", s, a, r, s_prime, a_prime, "\n")
}
if (method == "sarsa")
# is called Sarsa because it uses the sequence s, a, r, s', a'
Q[s, a] <-
Q[s, a] + alpha * (r + gamma * Q[s_prime, a_prime] - Q[s, a])
else if (method == "q") {
# a' is greedy instead of using the behavior policy
a_max <- greedy_action(s_prime, Q, epsilon = 0)
Q[s, a] <-
Q[s, a] + alpha * (r + gamma * Q[s_prime, a_max] - Q[s, a])
} else if (method == "expected_sarsa") {
p <- greedy_prob(s_prime, Q, epsilon)
exp_Q_prime <-
sum(greedy_prob(s_prime, Q, epsilon) * Q[s_prime, ], na.rm = TRUE)
Q[s, a] <-
Q[s, a] + alpha * (r + gamma * exp_Q_prime - Q[s, a])
}
s <- s_prime
a <- a_prime
if (is_terminal(s))
break
i <- i + 1L
}
}
Q
}
```

## Sarsa

Sarsa is on-policy and calculates the TD-error for the update as:

$$
\gamma\,Q(S', A') - Q(S, A)
$$
where the $S'$ and $A'$ are determined by the same policy that is used for the agents
behavior. In this case it is $\epsilon$-greedy.


```{r}
Q <- DT_learning(method = "sarsa", N = 10000, verbose = FALSE)
Q
```


Calculate the value function $U$ from the learned Q-function as the largest
Q value of any action in a state.
```{r}
U <- apply(Q, MARGIN = 1, max, na.rm = TRUE)
show_layout(U)
```

Extract the greedy policy for the learned $Q$-value function.

```{r}
pi <- A[apply(Q, MARGIN = 1, which.max)]
show_layout(pi)
```

## Q-Learning

Q-Learning is off-policy and calculates the TD-error for the update as:

$$
\gamma\,\mathrm{max}_a Q(S', a) - Q(S, A)
$$
where the target policy is greedy reflected by the maximum that chooses the
action with the largest $Q$-value.

```{r}
Q <- DT_learning(method = "q", N = 10000, verbose = FALSE)
Q
```


Calculate the value function $U$ from the learned Q-function as the largest
Q value of any action in a state.
```{r}
U <- apply(Q, MARGIN = 1, max, na.rm = TRUE)
show_layout(U)
```

Extract the greedy policy for the learned $Q$-value function.

```{r}
pi <- A[apply(Q, MARGIN = 1, which.max)]
show_layout(pi)
```

## Expected Sarsa

Expected Sarsa calculates the TD-error for the update as:

$$
\gamma\, E_\pi[Q(S', Q')] - Q(S, A) = \gamma \sum_a\pi(a|S')Q(S', a) - Q(S, A)
$$
using the expected value under the current policy. It moves deterministically
in the same direction as Sarsa moved in expectation. Because it uses the expectation,
we can set $\alpha$ to large values and even 1.

```{r}
Q <- DT_learning(method = "expected_sarsa", N = 10000, alpha = 1, verbose = FALSE)
Q
```

Calculate the value function $U$ from the learned Q-function as the largest
Q value of any action in a state.
```{r}
U <- apply(Q, MARGIN = 1, max, na.rm = TRUE)
show_layout(U)
```

Extract the greedy policy for the learned $Q$-value function.

```{r}
pi <- A[apply(Q, MARGIN = 1, which.max)]
show_layout(pi)
```
Not implemented yet.

## Reducing $\epsilon$ and $\alpha$ Over Time

To improve convergence, $\epsilon$ and $\alpha$ are typically reduced
slowly over time. This is not implemented here.
Loading

0 comments on commit 1bc4d9e

Please sign in to comment.