Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Durivaux - final submission #36

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
__pycache__/

**/__pycache__
Durivaux/.ipynb_checkpoints
35 changes: 35 additions & 0 deletions Durivaux/FlappyAgent.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
import numpy as np
from collections import deque
from skimage import color, transform
from keras.models import load_model

stackedX = []
call = 0
actions = [119, None]
dqn = load_model('dqn-925k.h5')
# Choose a new action every REPEAT call
REPEAT = 2
lastAction = None

def processScreen(screen):
""" Resize and gray-ify screen """
return 255*transform.resize(color.rgb2gray(screen[60:,25:310,:]),(80,80))

def FlappyPolicy(state, screen):
global stackedX, call, actions, dqn, lastAction

screenX = processScreen(screen)

if call == 0:
stackedX = deque([screenX]*4, maxlen=4)
x = np.stack(stackedX, axis=-1)
else:
stackedX.append(screenX)
x = np.stack(stackedX, axis=-1)

Q = dqn.predict(np.array([x]))

if call % REPEAT == 0 or REPEAT == 1:
lastAction = actions[np.argmax(Q)]
call += 1
return lastAction
87 changes: 87 additions & 0 deletions Durivaux/Learning_curves.ipynb

Large diffs are not rendered by default.

61 changes: 61 additions & 0 deletions Durivaux/MemoryBuffer.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
import numpy as np
from collections import deque

class MemoryBuffer:
"""
An experience replay buffer using numpy arrays
"""
def __init__(self, length, screen_shape, action_shape):
self.length = length
self.screen_shape = screen_shape
self.action_shape = action_shape
shape = (length,) + screen_shape
self.screens_x = np.zeros(shape, dtype=np.uint8) # starting states
self.screens_y = np.zeros(shape, dtype=np.uint8) # resulting states
shape = (length,) + action_shape
self.actions = np.zeros(shape, dtype=np.uint8) # actions
self.rewards = np.zeros((length,1), dtype=np.float64) # rewards #was uint8
self.terminals = np.zeros((length,1), dtype=np.bool) # true if resulting state is terminal
self.terminals[-1] = True
self.index = 0 # points one position past the last inserted element
self.size = 0 # current size of the buffer

def append(self, screenx, a, r, screeny, d):
self.screens_x[self.index] = screenx
self.actions[self.index] = a
self.rewards[self.index] = r
self.screens_y[self.index] = screeny
self.terminals[self.index] = d
self.index = (self.index+1) % self.length
self.size = np.min([self.size+1,self.length])

def stacked_frames_x(self, index):
im_deque = deque(maxlen=4)
pos = index % self.length
for i in range(4):
im = self.screens_x[pos]
im_deque.appendleft(im)
test_pos = (pos-1) % self.length
if self.terminals[test_pos] == False:
pos = test_pos
return np.stack(im_deque, axis=-1)

def stacked_frames_y(self, index):
im_deque = deque(maxlen=4)
pos = index % self.length
for i in range(4):
im = self.screens_y[pos]
im_deque.appendleft(im)
test_pos = (pos-1) % self.length
if self.terminals[test_pos] == False:
pos = test_pos
return np.stack(im_deque, axis=-1)

def minibatch(self, size):
indices = np.random.choice(self.size, size=size, replace=False)
x = np.zeros((size,)+self.screen_shape+(4,))
y = np.zeros((size,)+self.screen_shape+(4,))
for i in range(size):
x[i] = self.stacked_frames_x(indices[i])
y[i] = self.stacked_frames_y(indices[i])
return x, self.actions[indices], self.rewards[indices], y, self.terminals[indices]
24 changes: 24 additions & 0 deletions Durivaux/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
# Deep Q-learning for FlappyBird agent

Implementation of a deep Q-learning method for a pixel-based agent with no prior knowledge.

This work is based on Emmanuel Rachelson's Machine Learning classes (ISAE-Supaéro 2017-2018), alongside [this article](https://www.nature.com/articles/nature14236).

This particular implementation has the following features:
* the agent only chooses how to act every 2 frames, and repeats this action the next frame
* two neural networks are used: the usual one and a target-generating one, with regular (every 2500 frames) weights transfers between the first one and the second one
* training of the network is done every 5 frames for speed of training
* training on minibatches (size: 32 frames)
* replay memory (unlimited)
* initial exploration, then (decreasing) epsilon-greedy actions
* regular backup of the network: ability to choose the best one (based on learning curves)

# Results

Depending on the parameters, the target score of 15 can be reached in less than 200k frames. The proposed solution here took 925k frames for training, but reaches a much better average.

![learning](./learning.png)

Computation time, including lengthy evaluation periods: 6.5 hours (i7-4790K, GTX770, 16 GiB of RAM)

Over 100 games: average of 116.16, with a maximum of 466.
Binary file added Durivaux/dqn-925k.h5
Binary file not shown.
Binary file added Durivaux/learning.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
File renamed without changes.
File renamed without changes.
File renamed without changes.
31 changes: 31 additions & 0 deletions Durivaux/run_with_display.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
from ple.games.flappybird import FlappyBird
from ple import PLE
import numpy as np
from FlappyAgent import FlappyPolicy
import time
game = FlappyBird(graphics="fixed") # use "fancy" for full background, random bird color and random pipe color, use "fixed" (default) for black background and constant bird and pipe colors.
p = PLE(game, fps=30, frame_skip=1, num_steps=1, force_fps=True, display_screen=True)
# Note: if you want to see you agent act in real time, set force_fps to False. But don't use this setting for learning, just for display purposes.

p.init()
reward = 0.0

nb_games = 100
cumulated = np.zeros((nb_games))

for i in range(nb_games):
p.reset_game()

while(not p.game_over()):
state = game.getGameState()
screen = p.getScreenRGB()
action=FlappyPolicy(state, screen)

reward = p.act(action)
cumulated[i] = cumulated[i] + reward
print("{}\t{}\t{:.1f}".format(i, int(cumulated[i]), np.mean(cumulated[:i+1])))

average_score = np.mean(cumulated)
max_score = np.max(cumulated)
print()
print("Average: {:.2f}\t Max: {:.0f}".format(average_score, max_score))
Loading