【RL入门】Model-free下的策略评估
Learning to estimate the value of policies
最直接的想法:在环境中反复运行该策略,然后平均得到的回报
Reward: agent 获得的“一步奖励”信号
Return: 即总折扣 reward
Value function: 对 return 的期望或平均。Agent 试图优化的已知是预期总折扣奖励:value function
The random walk environment
- In random walk (RW) environment, the agent will go left with 50% and right with 50% regardless of the action it takes.
可将 RW 环境视作确定性环境,但智能体均匀选择策略,我们来评估这一策略。
Monte Carlo 算法
用策略运行几个 episode,收集多条 trajectory.,然后计算每个状态的平均值从而估计 value function
- experience tuple:
- trajectory: a sequence of experiences
如何计数?(一个轨迹可能包含同一状态的多次访问) - First-visit Monte Carlo (FVMC): 每个状态仅使用第一次到达的回报
- Every-visit Monte Carlo (EVMC): 所有访问状态后的回报进行平均
The more “standard” version is FVMC, and its convergence properties are easy to justify because of IID sample of
. Fortunately, EVMC has also been proven to converge given infinite samples.
Incremental methods (增量法): 指对估计值的反复改进。动态规划是一种增量方法,不与环境“交互”。RL 也是 incremental 的。
Sequential methods: 在有多个非终止态环境中学习。动态规划是一种惯序方法,Bandits 不是。
Trial-and-error methods (试错法): 在与环境互动中学习,动态规划不是试错学习,Bandits,RL 是试错学习。
Temporal-difference learning: Improving estimates after each step
虽然 MC 实际回报是非常准确的估计,但具有大的方差,样本效率低。TD 算法思想:bootstrapping,使用单步回报
- recursive style return:
【TD vs. DP】
DP methods bootstrap on the one-step expectation while TD methods bootstrap on a sample of the one-step expectation.
- TD target, is a biased estimate of the true state-value function
, it also has a much lower variance than the actual return - TD 算法是诸多算法的先驱如 SARSA,Q-Learning,double Q-Learning,DQN 等
MC vs. TD
show only first-visit Monte Carlo prediction (FVMC) and temporal-difference learning (TD).
- MC estimates are noisy; TD estimates are off-target
Learning to estimate from multiple steps
N-step TD learning
Forward-view TD(λ): Improving estimates of all visited states
Forward-view TD(
TD (λ): Improving estimates of all visited states after each step
跟踪每一步有资格更新的状态,且量化其更新程度
时, TD=TD(0) 时, MC TD(1)
In reality, it’s equal to MC assuming offline updates, assuming the updates are accumulated and applied at the end of the episode.
Simulate
Russell and Norvig’s Gridworld environment
Russell and Norvig’s Gridworld (RNG), is a 3 x 4 grid world in which the agent starts at the bottom-left corner, and it has to reach the top-right corner.