【RL入门】Model-free下的动作优化
RL agents
【RL入门】Model-free下的动作优化一节中解决了学习评估与预测回报问题(how to make agents most accurately estimate the value function of a given policy)。本节实现可解决控制问题的 agent。需要两处更改:
- 评估动作-值函数
来提到评估 V 函数 - 将获得的 Q 值用于策略优化
Most agents gather experience samples
对应采样性
Most agents estimate something
对应评估性
Most agents improve a policy
优化值函数/model,还是直接优化 policy?
Generalized policy iteration
广义策略迭代(generalized policy iteration, GPI)即策略评估与策略优化不断交互直至策略最优。大多数 RL 算法均遵循这一模式
Improve policies of behavior
本节采用 The Slippery Walk Seven (SWS) environment 环境进行讨论
Monte Carlo control: Improving policies after each episode
策略评估阶段,使用 FSMC 算法,策略优化阶段,使用衰减的epsilon贪婪策略。至此有了一个 model-free 的 RL 算法
SARSA: Improving policies after each step
使用 TD算法 替代 MC 预测,策略优化阶段仍采用衰减的epsilon贪婪策略
原本时序差分方法更新
- Sarsa 属于单步更新算法,每执行一个动作,就会更新一次价值和策略。
- n-step Sarsa: 采取
步动作后更新
Q-Learning
- Off-policy
- 目标策略(target policy)
- 行为策略(behavior policy),这里采用
-greedy
- 目标策略(target policy)
仿真对比(Sarsa/Q-Learning)
根据 gym-Cliff Walking 环境测试
- 测试结果
橘色为 Q-Learning,紫色为 Sarsa
Output
......
EnvName: CliffWalking-v0, Seed: 901, Steps: 20000, Episode reward:-13, terminated:True
Q-Learning: model/Q_learning_table.npy saved.
......
EnvName: CliffWalking-v0, Seed: 1681, Steps: 20000, Episode reward:-17, terminated:True
Sarsa: model/Sarsa_Q_table.npy saved.
- Q 值训练
# QLearningAgent():
def train(self, s, a, r, s_next, a_next, terminated):
'''Update Q table
Q<—— Q+alpha*[r + gamma*(Q(s',a') - Q)]
'''
Q_sa = self.Q[s, a]
target_Q = r + (1 - terminated) * self.gamma * self.Q[s_next, a_next]
self.Q[s, a] += self.lr * (target_Q - Q_sa)
# SarsaAgent:
def train(self, s, a, r, s_next, a_next, terminated):
'''Update Q table
Q<—— Q+alpha*[r + gamma*(Q(s',a') - Q)]
'''
Q_sa = self.Q[s, a]
target_Q = r + (1 - terminated) * self.gamma * self.Q[s_next, a_next]
self.Q[s, a] += self.lr * (target_Q - Q_sa)
- 策略评估
def evaluate_policy(env, agent):
s, info = env.reset()
done, ep_r, steps = False, 0, 0
while not done:
# Take deterministic actions at test time
a = agent.select_action(s, deterministic=True)
s_next, r, terminated, truncated, info = env.step(a)
done = (terminated or truncated)
ep_r += r
steps += 1
s = s_next
return ep_r, terminated
- 分析
由于 Sarsa 为 on-policy,为了兼顾探索和利用,它训练的时候会显得有点“胆小”。它在解决悬崖行走问题的时候,会尽可能地远离悬崖边,确保哪怕自己不小心探索了一点儿,也还是在安全区域内。因此为了减少训练时的损失,会过分放大“悬崖”的危险导致测试时“不敢”靠近附近 state ,从而在该环境下很难最优... - 代码主要参考了 Github - DRL-Pytorch,万分感谢!