Skip to content

Commit

Permalink
Solution to problem berkeleydeeprlcourse#5
Browse files Browse the repository at this point in the history
  • Loading branch information
Abdelrahman Ogail committed Apr 19, 2019
1 parent 2919a91 commit 42a1fa1
Show file tree
Hide file tree
Showing 3 changed files with 42 additions and 20 deletions.
39 changes: 30 additions & 9 deletions hw2/README.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,14 @@
# CS294-112 HW 2: Policy Gradient

Dependencies:
* Python **3.5**
* Numpy version **1.14.5**
* TensorFlow version **1.10.5**
* MuJoCo version **1.50** and mujoco-py **1.50.1.56**
* OpenAI Gym version **0.10.5**
* seaborn
* Box2D==**2.3.2**

- Python **3.5**
- Numpy version **1.14.5**
- TensorFlow version **1.10.5**
- MuJoCo version **1.50** and mujoco-py **1.50.1.56**
- OpenAI Gym version **0.10.5**
- seaborn
- Box2D==**2.3.2**

Before doing anything, first replace `gym/envs/box2d/lunar_lander.py` with the provided `lunar_lander.py` file.

Expand All @@ -16,11 +17,15 @@ The only file that you need to look at is `train_pg_f18.py`, which you will impl
See the [HW2 PDF](http://rail.eecs.berkeley.edu/deeprlcourse/static/homeworks/hw2.pdf) for further instructions.

# Answers to Homework Experiments

## Problem 4 (CartPole)

### Summary

The benchmark included running multiple experiments with tuning parameters like using [rewards to go, monte carlo rewards], [advantage normalization, no advantage normalization], [large batch size, small batch size]. Then number of iterations were 100 per experiment and each configuration were experimented 3 times to understand variance as well. Below are general observations:
- Convergence: using reward to go resulted into faster convergence than monte carlo reward
- Variance: the following parameters helped reducing the variance: increasing batch size and advantage normalization

- Convergence: using reward to go resulted into faster convergence than monte carlo reward
- Variance: the following parameters helped reducing the variance: increasing batch size and advantage normalization

### Plots

Expand All @@ -29,11 +34,27 @@ The benchmark included running multiple experiments with tuning parameters like
![](fig/sb_CartPole-v0.png)

### Answers

Q1- Which gradient estimator has better performance without advantage-centering—the trajectory-centric one, or the one using reward-to-go?

> The reward to go is better because it has lower variance.
Q2- Did advantage centering help?

> Yes it did help reduce the variance and speed up convergence a bit
Q3- Did the batch size make an impact?

> Yes it did, larger batch sizes result in lower variance and low bias
## Problem 5

### Summary

The command below is used to get the fig

```bash
python3 train_pg_f18.py InvertedPendulum-v2 -n 100 -b 5000 -e 5 -rtg --exp_name hc_b5000_r0.0111 --learning_rate 1e-2 --n_layers 2 --size 16
```

![](fig/InvertedPendulum-v2.png)
Binary file added hw2/fig/InvertedPendulum-v2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
23 changes: 12 additions & 11 deletions hw2/train_pg_f18.py
Original file line number Diff line number Diff line change
Expand Up @@ -207,9 +207,9 @@ def sample_action(self, policy_parameters):
else:
sy_mean, sy_logstd = policy_parameters
# YOUR_CODE_HERE
sy_sampled_ac = sy_mean + tf.multipy(tf.math.exp(sy_logstd),
tf.random_normal(shape=sy_mean.shape))
assert sy_sampled_ac.shape.as_list() == [sy_mean.shape.as_list()]
sy_sampled_ac = sy_mean + \
tf.math.multiply(tf.math.exp(sy_logstd), tf.random_normal(shape=sy_logstd.shape))
assert sy_sampled_ac.shape.as_list() == sy_mean.shape.as_list()
return sy_sampled_ac

#========================================================================================#
Expand Down Expand Up @@ -250,9 +250,11 @@ def get_log_prob(self, policy_parameters, sy_ac_na):
# initialize a single self.ac_dim-variate Gaussian.
mvn = tf.contrib.distributions.MultivariateNormalDiag(loc=sy_mean,
scale_diag=tf.math.exp(sy_logstd))
sy_logprob_n = mvn.log_prob(sy_ac_na)

assert sy_logprob_n.shape.as_list() == sy_mean.shape.as_list()
# CORRECTION: because log probability is negative and because of loss expects +ve values
# the log prob is multiplied by -1 to enable optimzation process to work
sy_logprob_n = -mvn.log_prob(sy_ac_na)
assert sy_logprob_n.shape.as_list() == [sy_mean.shape.as_list()[0]]
self.sy_logprob_n = sy_logprob_n
return sy_logprob_n

def build_computation_graph(self):
Expand Down Expand Up @@ -294,7 +296,6 @@ def build_computation_graph(self):
# Loss Function and Training Operation
#========================================================================================#
# YOUR CODE HERE
# EXPERIMENT use * instead of tf.multiply operator
self.loss = tf.reduce_mean(self.sy_logprob_n * self.sy_adv_n)
self.update_op = tf.train.AdamOptimizer(self.learning_rate).minimize(self.loss)

Expand Down Expand Up @@ -350,11 +351,11 @@ def sample_trajectory(self, env, animate_this_episode):
#====================================================================================#
# ----------PROBLEM 3----------
#====================================================================================#
ac = self.sess.run(self.sy_sampled_ac, feed_dict={
self.sy_ob_no: ob[None]}) # YOUR CODE HERE
# YOUR CODE HERE
ac = self.sess.run(self.sy_sampled_ac, feed_dict={self.sy_ob_no: ob[None]})
ac = ac[0]
acs.append(ac)
ob, rew, done, _ = env.step(ac.squeeze())
ob, rew, done, _ = env.step(ac)
rewards.append(rew)
steps += 1
if done or steps > self.max_path_length:
Expand Down Expand Up @@ -564,7 +565,7 @@ def update_parameters(self, ob_no, ac_na, q_n, adv_n):
# YOUR_CODE_HERE
_, loss, summary = self.sess.run([self.update_op, self.loss, self.merged],
feed_dict={self.sy_ob_no: ob_no,
self.sy_ac_na: ac_na.squeeze(),
self.sy_ac_na: ac_na,
self.sy_adv_n: adv_n})

# write logs at every iteration
Expand Down

0 comments on commit 42a1fa1

Please sign in to comment.