【RL入门】Model-free下的策略评估

同时具有惯序性与评估性的反馈中学习对状态的评估
Monte Carlo 算法
TD 算法

In machine learning, the saying goes, “The model is only as good as the data.” In RL, I say, “The policy is only as good as the estimates,” or, detailed, “The improvement of a policy is only as good as the accuracy and precision of its estimates.”

Learning to estimate the value of policies

最直接的想法：在环境中反复运行该策略，然后平均得到的回报

【Reward vs. Return vs. Value function】

Reward: agent 获得的“一步奖励”信号
img-20241120155916507|250
Return: 即总折扣 reward
img-20241120155940673|300
Value function: 对 return 的期望或平均。Agent 试图优化的已知是预期总折扣奖励：value function
img-20241120155956358|300

The random walk environment

In random walk (RW) environment, the agent will go left with 50% and right with 50% regardless of the action it takes.

在一个不可控的环境（选取动作与结果独立），研究 RL 有什么意义呢？

可将 RW 环境视作确定性环境，但智能体均匀选择策略，我们来评估这一策略。

Monte Carlo 算法

用策略运行几个 episode，收集多条 trajectory.，然后计算每个状态的平均值从而估计 value function

experience tuple: ${S_{t}, A_{t}, R_{t + 1}, S_{t + 1}}$
trajectory: a sequence of experiences

V_{T + 1} (S_{t}) = V_{T} (S_{t}) + α_{t} [\overset{MC error}{\overset{⏞}{\underset{MC target}{\underset{⏟}{G_{t : T}}} - V_{T} (S_{t})}}]

\begin{array}{r} V_{T} (S_{t}) = V_{T - 1} (S_{t}) + α_{t} G_{t : T} V_{T - 1} (S_{t}) \end{array}

$N_{T} (S_{t})$ 如何计数？（一个轨迹可能包含同一状态的多次访问）
- First-visit Monte Carlo (FVMC): 每个状态仅使用第一次到达的回报
- Every-visit Monte Carlo (EVMC): 所有访问状态后的回报进行平均

The more “standard” version is FVMC, and its convergence properties are easy to justify because of IID sample of $v_{π} (s)$ . Fortunately, EVMC has also been proven to converge given infinite samples.

【Incremental vs. sequential vs. trial-and-error】

Incremental methods (增量法): 指对估计值的反复改进。动态规划是一种增量方法，不与环境“交互”。RL 也是 incremental 的。
Sequential methods: 在有多个非终止态环境中学习。动态规划是一种惯序方法，Bandits 不是。
Trial-and-error methods (试错法): 在与环境互动中学习，动态规划不是试错学习，Bandits，RL 是试错学习。

Temporal-difference learning: Improving estimates after each step

虽然 MC 实际回报是非常准确的估计，但具有大的方差，样本效率低。TD 算法思想：bootstrapping，使用单步回报 $R_{t + 1}$ +已有的V 函数的估计值更新估计

$\begin{array}{r} G_{t : T} = R_{t + 1} + γ R_{t + 2} + . . . + γ^{T - 1} R_{T} \end{array}$
recursive style return: $G_{t : T} = R_{t + 1} + γ G_{t + 1 : T}$
$v_{π} (s) = E_{π} [G_{t : T} ∣ S_{t} = s] = E_{π} [R_{t + 1} + γ v_{π} (S_{t + 1}) ∣ S_{t} = s]$

V_{t + 1} (S_{t}) = V_{t} (S_{t}) + α_{t} [\overset{TD error}{\overset{⏞}{\underset{TD target}{\underset{⏟}{R_{t + 1} + γ V_{t} (S_{t + 1})}} - V_{t} (S_{t})}}]

【TD vs. DP】
DP methods bootstrap on the one-step expectation while TD methods bootstrap on a sample of the one-step expectation.

TD target, is a biased estimate of the true state-value function $v_{π} (s)$ , it also has a much lower variance than the actual return $G_{t : T}$
TD 算法是诸多算法的先驱如 SARSA，Q-Learning，double Q-Learning，DQN 等

MC vs. TD

show only first-visit Monte Carlo prediction (FVMC) and temporal-difference learning (TD).

MC estimates are noisy; TD estimates are off-target

Learning to estimate from multiple steps

N-step TD learning

V_{t + n} (S_{t}) = V_{t + n - 1} (S_{t}) + α_{t} [\overset{n-step error}{\overset{⏞}{\underset{n-step target}{\underset{⏟}{G_{t : t + n - 1} + γ^{n} V_{t + n - 1} (S_{t + n})}} - V_{t + n - 1} (S_{t})}}]

Forward-view TD(λ): Improving estimates of all visited states

Forward-view TD( $λ$ ) is a prediction method that combines multiple n-steps into a single update

G_{t : T}^{λ} = \underset{Sum of weighted returns of 1:T-1 steps}{\underset{⏟}{(1 - λ) \sum_{n = 1}^{T - t - 1} λ^{n - 1} G_{t : t + n}}} + \overset{Weighted final return (T)}{\overset{⏞}{λ^{T - t - 1} G_{t : T}}}

V_{T} (S_{t}) = V_{T - 1} (S_{t}) + α_{t} [\overset{λ -error}{\overset{⏞}{\underset{λ -return}{\underset{⏟}{G_{t : T}^{λ}}} - V_{T - 1} (S_{t})}}]

TD (λ): Improving estimates of all visited states after each step

跟踪每一步有资格更新的状态，且量化其更新程度

$λ = 0$ 时, TD=TD(0)
$λ = 1$ 时, MC $\approx$ TD(1)

In reality, it’s equal to MC assuming offline updates, assuming the updates are accumulated and applied at the end of the episode.

Simulate

Russell and Norvig’s Gridworld environment

Russell and Norvig’s Gridworld (RNG), is a 3 x 4 grid world in which the agent starts at the bottom-left corner, and it has to reach the top-right corner.

Learning to estimate the value of policies

The random walk environment

Monte Carlo 算法

Temporal-difference learning: Improving estimates after each step

MC vs. TD

Learning to estimate from multiple steps

N-step TD learning

Forward-view TD(λ): Improving estimates of all visited states

TD (λ): Improving estimates of all visited states after each step

Simulate

Russell and Norvig’s Gridworld environment

FVMC, TD, n-step TD, and TD (λ) in the RNG environment