【RL入门】Model-free下的策略评估

In machine learning, the saying goes, “The model is only as good as the data.” In RL, I say, “The policy is only as good as the estimates,” or, detailed, “The improvement of a policy is only as good as the accuracy and precision of its estimates.”

Learning to estimate the value of policies

最直接的想法:在环境中反复运行该策略,然后平均得到的回报

The random walk environment

Monte Carlo 算法

用策略运行几个 episode,收集多条 trajectory.,然后计算每个状态的平均值从而估计 value function

VT+1(St)=VT(St)+αt[Gt:TMC targetVT(St)MC error]VT(St)=VT1(St)+αtGt:TVT1(St)

img-20241119143523624|500
img-20241119144125288|500

The more “standard” version is FVMC, and its convergence properties are easy to justify because of IID sample of vπ(s). Fortunately, EVMC has also been proven to converge given infinite samples.

Temporal-difference learning: Improving estimates after each step

虽然 MC 实际回报是非常准确的估计,但具有大的方差,样本效率低。TD 算法思想:bootstrapping,使用单步回报 Rt+1 +已有的V 函数的估计值更新估计

Vt+1(St)=Vt(St)+αt[Rt+1+γVt(St+1)TD targetVt(St)TD error]

img-20241119154204732|500

【TD vs. DP】
DP methods bootstrap on the one-step expectation while TD methods bootstrap on a sample of the one-step expectation.

MC vs. TD

show only first-visit Monte Carlo prediction (FVMC) and temporal-difference learning (TD).
img-20241119154254665|500

Learning to estimate from multiple steps

img-20241119154358959|300

N-step TD learning

Vt+n(St)=Vt+n1(St)+αt[Gt:t+n1+γnVt+n1(St+n)n-step targetVt+n1(St)n-step error]

Forward-view TD(λ): Improving estimates of all visited states

Forward-view TD(λ) is a prediction method that combines multiple n-steps into a single update
img-20241119154727012|500

Gt:Tλ=(1λ)n=1Tt1λn1Gt:t+nSum of weighted returns of 1:T-1 steps+λTt1Gt:TWeighted final return (T)VT(St)=VT1(St)+αt[Gt:Tλλ-returnVT1(St)λ-error]

TD (λ): Improving estimates of all visited states after each step

跟踪每一步有资格更新的状态,且量化其更新程度
img-20241119155555606|500

In reality, it’s equal to MC assuming offline updates, assuming the updates are accumulated and applied at the end of the episode.

Simulate

Russell and Norvig’s Gridworld environment

Russell and Norvig’s Gridworld (RNG), is a 3 x 4 grid world in which the agent starts at the bottom-left corner, and it has to reach the top-right corner.
img-20241119160135210|500
img-20241119160142046|500

FVMC, TD, n-step TD, and TD (λ) in the RNG environment

img-20241119160229697|500


© 2024 LiQ :) 由 Obsidian&Github 强力驱动