【RL入门】Model-free下的动作优化

RL agents

【RL入门】Model-free下的动作优化一节中解决了学习评估与预测回报问题(how to make agents most accurately estimate the value function of a given policy)。本节实现可解决控制问题的 agent。需要两处更改:

  1. 评估动作-值函数 Q(s,a) 来提到评估 V 函数
  2. 将获得的 Q 值用于策略优化
Most agents gather experience samples

对应采样性

Most agents estimate something

对应评估性

Most agents improve a policy
优化值函数/model,还是直接优化 policy?

Generalized policy iteration

广义策略迭代(generalized policy iteration, GPI)即策略评估与策略优化不断交互直至策略最优。大多数 RL 算法均遵循这一模式

Improve policies of behavior

本节采用 The Slippery Walk Seven (SWS) environment 环境进行讨论
img-20241120160859277|500

Monte Carlo control: Improving policies after each episode

策略评估阶段,使用 FSMC 算法,策略优化阶段,使用衰减的epsilon贪婪策略。至此有了一个 model-free 的 RL 算法

SARSA: Improving policies after each step

使用 TD算法 替代 MC 预测,策略优化阶段仍采用衰减的epsilon贪婪策略
原本时序差分方法更新 V 的过程,变成了更新 Qon-policy
img-20241120161435164|500

Q(st,at)Q(st,at)+α[rt+1+γQ(st+1,at+1)Q(st,at)]

Q-Learning

Q(st,at)Q(st,at)+α[rt+1+γmaxaQ(st+1,a)Q(st,at)]

img-20241211160330680|300

仿真对比(Sarsa/Q-Learning)

根据 gym-Cliff Walking 环境测试
|500

# QLearningAgent():
def train(self, s, a, r, s_next, a_next, terminated):
	'''Update Q table
	Q<—— Q+alpha*[r + gamma*(Q(s',a') - Q)]
	'''
	Q_sa = self.Q[s, a]
	target_Q = r + (1 - terminated) * self.gamma * self.Q[s_next, a_next]
	self.Q[s, a] += self.lr * (target_Q - Q_sa)

# SarsaAgent:
def train(self, s, a, r, s_next, a_next, terminated):
	'''Update Q table
	Q<—— Q+alpha*[r + gamma*(Q(s',a') - Q)]
	'''
	Q_sa = self.Q[s, a]
	target_Q = r + (1 - terminated) * self.gamma * self.Q[s_next, a_next]
	self.Q[s, a] += self.lr * (target_Q - Q_sa)
def evaluate_policy(env, agent):
	s, info = env.reset()
	done, ep_r, steps = False, 0, 0
	while not done:
	    # Take deterministic actions at test time
	    a = agent.select_action(s, deterministic=True)
	    s_next, r, terminated, truncated, info = env.step(a)
	    done = (terminated or truncated)
	
	    ep_r += r
	    steps += 1
	    s = s_next
	return ep_r, terminated

© 2024 LiQ :) 由 Obsidian&Github 强力驱动