-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathPolicy_gradient_breakout.py
205 lines (162 loc) · 10.2 KB
/
Policy_gradient_breakout.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
import numpy as np
import tensorflow as tf
import gym
import pylab as pl
""""
Basic explanation of Policy Gradient:
In policy gradient, neural net outputs a probability for each available action.
The idea of this method is to train neural net so that probabilities of actions
after which you've gotten reward in your past experiences are increased.
For example, the player is in state 'A' (pixel values in this example)
and he randomly picked action 3 from [1, 2, 3] set of actions.
He got lucky and this action yielded him a reward of 1! Wohoo!
We've collected this data and now we'll train the network to increase the
probability of action 3 whatever we are in a state similar or equal to 'A'.
That's the basic idea. Here's a short overview of the main loop.
First you play the game :
1. Pre-process state (observation) and be ready to feed it to neural net.
2. Input the observation into neural net and sample an action from its output.
By sample, I mean sample an action from action set [1, 2, 3] when the
probabilities of actions are e.g. [0.2, 0.3, 0.5]. Simple stochastic process.
3. Make that action and observe the results: state, reward, ...
4. Create 'correct' output value (ylabel) of neural net for this state.
This step is a little tricky to understand at first. The 'correct' output is
always the one you've sampled from neural net. E.g if action 3 was correct,
than correct output is [0, 0, 1] where third column corresponds to action 3.
4. Save the data: state, the 'correct' output and the reward you've got.
This is what you do after each frame. Now do this for a while (a few
or one game, however you like) to collect enough data for one training iteration.
After you've collected enough data, do one iteration of training:
1. Construct gradient loss vector. This step is the core of gradient decent method.
It is simply a vector of discounted reward vector values collected during the game.
You can make additional transformations, e.g. normalize or/and shift this vector to improve results.
2. Adjust the errors of net's final layer. Before backprop, you multiply the errors of
your network's final layer with this vector of discounted reward. In turn, this will
increase the errors before sequences of actions that led to reward.
3. Do optimization. After adjusting errors and performing optimization using backprop, the
probability of a sequence of actions which led to reward will increase.
4. Clear all the data that you've just trained with and play the game again to collect new data.
That's it.
This code is highly based on the work of these guys (it helped me to understand Policy Gradient big time):
https://gist.github.com/karpathy/a4166c7fe253700972fcbc77e4ea32c5
https://gist.github.com/greydanus/5036f784eec2036252e1990da21eda18
"""
# functions
def prep_observation(observation, zeros_and_ones=False):
obs_2d = observation[:, :, 0] # from RGB to R
obs_2d_cut = obs_2d[93:193, 8:152] # Specific to Breakout: whole space 33:193, 8:152 ; not including bricks 93:193, 8:152
obs_2d_cut_ds = obs_2d_cut[::2, ::2] # downsample by 2: a b c d e f d -> to -> a c e d
if zeros_and_ones: obs_2d_cut_ds[obs_2d_cut_ds != 0] = 1 # convert to 0's and 1's
obs_1d = np.ndarray.flatten(obs_2d_cut_ds, order='C') # array to list
return obs_1d
def forward_propagation(input, O):
# specify NN
# in this case: 1 hidden layer with relu; softmax output
l1 = tf.nn.relu(tf.matmul(input, O['w1']))
l2 = tf.nn.softmax(tf.matmul(l1, O['w2']))
return l2
def discount_and_normalize(input, discount_const, normalized, frame_shift):
# Discounting
data_disc = np.zeros_like(input).astype(float)
data_disc[len(input) - 1] = input[len(input) - 1] # no discount for last reward
for i in reversed(range(0, len(data_disc) - 1)):
if input[i] != 0: # keep original if non zero
data_disc[i] = input[i]
else: # this value = the value before * discount
data_disc[i] = data_disc[i + 1] * discount_const
# Normalizing
if normalized:
data_disc = (data_disc - np.mean(data_disc)) / (np.std(data_disc) + 1e-10)
# shifting to account for late reward after good actions (lame way to try solve credit assignment problem)
if frame_shift:
length = len(data_disc)
data_disc_shifted = np.zeros_like(data_disc)
data_disc_shifted[0:length-frame_shift, ] = data_disc[-(length-frame_shift):, ] # shift by x frames
data_disc = data_disc_shifted # rename for easy return
return data_disc
def plot_pixels(observation): # this one is used to plot how input to nn looks like
pl.imshow(observation.reshape(50,72)) # whole space 80, 120 ; not including bricks 50, 72
# hyperparameters
learning_rate = 0.002
reward_shift = 10 # to account for lagging reward (frames)
reward_discount = 0.99
obs_discount = 0.8 # discount last frame to account for velocity of objects (used in running_frame)
training_batch_size = 5 # number of games to play before performing one optimization step
neurons = 32 # single hidden layer with that many neurons
input_size = 3600 # hand written number indicating number of pixels fed into nn
actions = [0, 1, 2] # true actions are 1, 2, 3 this is used for indexing
session_name = 'LR %.4f, RS %d, RD %.2f, TB %d, N %d' % (learning_rate, reward_shift, reward_discount, training_batch_size, neurons) # filename for saving
# initialize tf placeholders
x = tf.placeholder(shape=[None, input_size], dtype=tf.float32)
y = tf.placeholder(shape=[None, len(actions)], dtype=tf.float32)
gradient_loss = tf.placeholder(shape=[None, 1], dtype=tf.float32)
# initialize weights using xavier initialization:
# http://andyljones.tumblr.com/post/110998971763/an-explanation-of-xavier-initialization
# https://prateekvjoshi.com/2016/03/29/understanding-xavier-initialization-in-deep-neural-networks/
O = {'w1': tf.Variable(tf.truncated_normal(shape=[input_size, neurons], mean=0, stddev=1/np.sqrt(input_size), dtype=tf.float32)),
'w2': tf.Variable(tf.truncated_normal(shape=[neurons, len(actions)], mean=0, stddev=1/np.sqrt(neurons), dtype=tf.float32))}
# Run this to load old weights if resuming training
# Trained_weighs = np.load('session_name' + '.npy').item()
# O['w1'] = tf.Variable(Trained_weighs['w1'])
# O['w2'] = tf.Variable(Trained_weighs['w2'])
# initialize other variables
x_batch, y_batch, reward_batch, games_played, running_reward, running_reward_hist = [], [], [], 0, None, []
pl.ioff() # to turn off plotting in IDE (save it to a file instead) plt.ion() to reverse
yprob = forward_propagation(x, O)
cost = tf.nn.l2_loss(y - yprob)
# initialize optimizer
# https://www.tensorflow.org/api_docs/python/tf/train/AdamOptimizer
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
gradients = optimizer.compute_gradients(loss=cost, var_list=tf.trainable_variables(), grad_loss=gradient_loss) # gradients computed with adjusted last layer's errors (using rewards)
train_step = optimizer.apply_gradients(gradients)
# initialize gamespace
env = gym.make("Breakout-v0")
observation = env.reset()
running_observation = prep_observation(observation, zeros_and_ones=True)
# plot_pixels(prep_observation(observation)) # run this to see how single frame fed to nn looks like
# initialize tf session
init = tf.global_variables_initializer()
sess = tf.Session()
sess.run(init)
# tf.reset_default_graph() # run this to clear all global variables
# playing / learning loop
while True:
# env.render()
# preprocess observation
modified_observation = prep_observation(observation, zeros_and_ones=True) # alter image: resize, flatten etc.
# create running frame (input to nn)
running_observation = modified_observation + running_observation * obs_discount
running_observation[running_observation < 0.5] = 0 ; running_observation[running_observation > 1.0] = 1 # trim
# plot_pixels(running_observation) ; pl.pause(1e-10) # run this to plot running frame each frame. Beware of lag.
# sample a policy from the network (using running frame made from preprocessed observations)
output = sess.run(yprob, feed_dict = {x: np.stack([running_observation])})[0]
action = np.random.choice(a=len(actions), p=output)
# step the environment and get new observation
observation, reward, done, info = env.step(action + 1) # +1 for actions 1, 2, 3 instead of 0, 1, 2
# create 'correct' nn output (the correct one is always the one that was sampled)
ylabel = np.zeros_like(output) ; ylabel[action] = 1
# record game history
x_batch.append(running_observation)
y_batch.append(ylabel)
reward_batch.append(reward)
if done: # end of collecting data for 1 game
# reset game and observations
observation = env.reset() ; running_observation = prep_observation(observation, zeros_and_ones=True)
games_played += 1
if games_played % training_batch_size == 0: # perform 1 optimization step after playing x games (single batch)
# optimization
gradient_loss_batch = discount_and_normalize(reward_batch, discount_const=reward_discount, normalized=True, frame_shift=reward_shift)
sess.run([train_step], feed_dict={x: np.stack(x_batch),
y: np.stack(y_batch),
gradient_loss: np.stack([gradient_loss_batch]).T})
# print intermediate results
average_reward = np.divide(np.sum(reward_batch), training_batch_size) # average reward over a batch of games
running_reward = 0.95 * running_reward + 0.05 * average_reward if running_reward is not None else average_reward
running_reward_hist.append(running_reward)
print('Running reward: %.2f Average batch reward: %.2f Games played: %d' % (running_reward, average_reward, games_played))
# reset batch memory and save
x_batch, y_batch, reward_batch = [], [], []
# save stuff
pl.plot(running_reward_hist)
pl.savefig('Running_reward_hist' + session_name + '.png') # look at the progress by opening this file
# np.save('Breakout_v1_trained_weights' + session_name, sess.run(O))