-
Notifications
You must be signed in to change notification settings - Fork 13
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
omitedy
authored
Jun 21, 2020
1 parent
45c14ef
commit 38617e4
Showing
7 changed files
with
231 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,45 @@ | ||
### 策略梯度 | ||
|
||
#### 参数化策略 | ||
|
||
策略本身也可以参数化,策略可以是确定的也可以是随机的 | ||
|
||
#### 基于策略的强化学习 | ||
|
||
优点 | ||
|
||
- 具有更好的收敛性质 | ||
- 在高纬度或连续的动作空间中更有效 | ||
- 能够学习出随机策略 | ||
|
||
缺点 | ||
|
||
- 通常会收敛到局部最优而非全局最优 | ||
- 评估一个策略通常不够高些并具有较大的方差 | ||
|
||
#### 策略梯度 | ||
|
||
知道决策的方向就可以向其更新 | ||
|
||
#### 单步马尔科夫决策过程中的策略梯度 | ||
|
||
策略的价值期望 | ||
|
||
#### 似然比(Likelihood Ratio) | ||
|
||
可以使用似然比改写策略的价值期望 | ||
|
||
#### 策略梯度定理 | ||
|
||
#### 蒙特卡洛策略梯度(REINFORCE) | ||
|
||
利用随机梯度上升更新参数 | ||
|
||
利用策略梯度定理 | ||
|
||
累计奖励值可作为无偏估计 | ||
|
||
#### Puck World冰球世界示例 | ||
|
||
#### Softmax随机策略 | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,44 @@ | ||
### Actor-Critic | ||
|
||
基于价值函数和策略梯度的算法 | ||
|
||
#### REINFORCE存在的问题 | ||
|
||
- 高训练方差(最重要的缺陷) | ||
- 低数据利用率 | ||
- 基于片段式数据的任务 | ||
|
||
#### Actor-Critic思想 | ||
|
||
REINFORCEC策略梯度方法是使用蒙特卡洛蔡阳直接$(s_t, a_t)$的值$G_t$ | ||
|
||
为什么不建立一个可训练的值函数$Q_(\Phi)$来完成这个估计过程? | ||
|
||
演员($(\Pi)_(\theta)(a \mid s)$):采取动作是评论家满意的策略 | ||
|
||
评论家($Q_(\Phi)(s, a)$):学会准确估计演员策略所采取的动作价值的值函数 | ||
|
||
#### Actor-Critic训练 | ||
|
||
评论家 | ||
|
||
- 负责学会准确估计当前演员策略(actor policy)的动作价值 | ||
|
||
演员 | ||
|
||
- 学会采取使评论家满意的动作 | ||
|
||
#### A2C: Advantageous Actor-Critic | ||
|
||
思想 | ||
|
||
- 通过减去一个基线函数来标准化评论家的打分 | ||
|
||
- 更多信息指导,降低较差动作概率,提高交较优动作概率 | ||
|
||
优势函数(Advantageous Function) | ||
|
||
状态-动作值和状态值函数 | ||
|
||
拟合状态值函数来拟合优势函数 | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,27 @@ | ||
### 深度强化学习 | ||
|
||
#### 价值和策略近似 | ||
|
||
状态值函数和状态动作值函数近似 | ||
|
||
#### 端到端强化学习 | ||
|
||
端到端:深度强化学习直接省去了特征的选用输出一个分类的概率 | ||
|
||
利用深度神经网络进行价值函数和策略近似 | ||
|
||
#### 深度强化学习带来的关键改变 | ||
|
||
- 价值函数和策略现在变成了深度神经网络 | ||
- 相当高维度的参数空间 | ||
- 难以稳定地训练 | ||
- 容易过拟合 | ||
- 需要大量数据 | ||
- 需要更高性能计算 | ||
- CPU、GPU之间的平衡,CPU负责收集经验数据,GPU用于训练网络 | ||
|
||
#### 深度强化学习的分类 | ||
|
||
- 基于价值的方法 | ||
- 基于随机策略的方法 | ||
- 基于确定性策略的方法 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,62 @@ | ||
### 深度Q网络 | ||
|
||
#### 深度Q网络 | ||
|
||
直观想法 | ||
|
||
- 使用神经网络逼近$Q_(\Theta)(s, a)$ | ||
- 算法不稳定 | ||
|
||
解决办法 | ||
|
||
- 经验回放 | ||
- 使用双网络结构:评估网络和目标网络 | ||
|
||
#### 经验回放 | ||
|
||
存储训练过程中的每一步到数据库中,采样时服从均匀分布 | ||
|
||
优先经验回放 | ||
|
||
- 衡量标准 | ||
- 选中的概率 | ||
- 重要性采样 | ||
|
||
#### 双网络结构 | ||
|
||
目标网络$Q_(\Theta)(s, a)$ | ||
|
||
- 使用较旧的参数,每个C步和训练网络的参数同步一次 | ||
|
||
#### 算法流程 | ||
|
||
1. 收集数据:使用$\epsilon - greedy$策略进行探索,将得到的状态动作组放入经验池(replay-buffer) | ||
2. 采样:从数据库中采样k个动作状态组 | ||
3. 更新网络 | ||
|
||
#### 在Atari环境中的实验结果 | ||
|
||
|
||
|
||
#### double DQN | ||
|
||
为了解决DQN的过高估计和有时候Q值很大 | ||
|
||
max操作会是的Q函数的值越来越大,甚至高于真实值 | ||
|
||
double DQN使用不同的网络来估值和决策 | ||
|
||
#### 过高估计的例子与在Atari环境中的实验结果 | ||
|
||
|
||
|
||
#### Dueling DQN | ||
|
||
#### 网络结构 | ||
|
||
#### 优点 | ||
|
||
- 处理与动作关联较小的状态 | ||
- 状态值函数的学习较为有效:一个状态值函数对应多个advantage函数 | ||
|
||
#### 在Atari环境中的实验结果 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,12 @@ | ||
### A3C:异步A2C方法 | ||
|
||
A3C代表了异步优势动作评价(Asynchronous Advantage Actor Critic) | ||
|
||
- 异步:算法设计并行执行一组环境 | ||
- 优势:因为策略梯度的更新使用优势函数 | ||
- 动作评价:这是一种动作评价方法,设计在一个学的的状态值函数帮助下进行更新的策略 | ||
|
||
网络结构图 | ||
|
||
A3C算法 | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,24 @@ | ||
### 确定性策略梯度 | ||
|
||
#### 随机策略与确定性策略 | ||
|
||
随机策略 | ||
|
||
- 对于离散动作 | ||
- 对于连续动作 | ||
|
||
确定性策略 | ||
|
||
- 对于离散动作 | ||
- 对于连续动作 | ||
|
||
#### 确定性策略梯度 | ||
|
||
用于估计状态-动作值的评论家模块 | ||
|
||
确定性策略 | ||
|
||
#### 确定性策略梯度实验效果 | ||
|
||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,17 @@ | ||
### 深度确定性策略梯度 | ||
|
||
在实际应用中,待遇神经函数近似器的Actor-Critic方法在面对有挑战性的问题时是不稳定的 | ||
|
||
深度确定性策略梯度(DDPG)给出了在确定性梯度策略基础上的解决方法 | ||
|
||
- 经验重放(离线策略) | ||
- 目标网络 | ||
- 在动作输入前标准化Q网络 | ||
- 添加连续噪声 | ||
|
||
#### DDPG训练伪代码 | ||
|
||
#### 深度确定性策略梯度实验 | ||
|
||
目标网络至关重要 | ||
|