【RL入门】动态规划求解MDP

RL 智能体的目标函数
利用策略估计 value functions，利用其改善策略方法（VI，PI）
序贯决策问题的最优策略

在 DRL 中，智能体从反馈中学习，这些反馈同时具有惯序性，评估性，采样性。
sequential (as opposed to one shot), evaluative (as opposed to supervised), and sampled (as opposed to exhaustive).
本篇仅关注惯序性反馈，即环境转移概率，“真实”奖励已知

The objective of agent

Maxmize the return

max_{π} \sum_{k = 0}^{\infty} γ^{k} R_{t + k + 1}

The Slippery Walk Five (SWF) environment

Slippery Walk Five (SWF) is a one-row grid-world environment (a walk), stochastic

【Example of Reward】
An episode in the SWF environment went this way: State 3 (0 reward), state 4 (0 reward), state 5 (0 reward), state 4 (0 reward), state 5 (0 reward), state 6 (+1 reward).

Policies: Per-state action prescriptions

Plan: a sequence of actions from the START state to the GOAL state.

仅仅有从起始状态到终止态的规划是否足够？

对随机环境是不够的，因为可能出现到达其他状态的可能

Policy: universal plans; cover all possible states; can be stochastic or deterministic
How to compare policies? ——>State-value function

State-value function: What to expect from here?

寻找 return 的期望！

Example

img-20241117144103060|500
$v_{π} (14) = \frac{1}{3} v_{π} (10) + \frac{1}{3} v_{π} (14) + \frac{1}{3} (v_{π} (15) + 1)$

the value of a state $s$ when following a policy $π$ : the value of a state $s$ under policy $π$ is the expectation of returns if the agent follows policy $π$ starting from state $s$ .
Bellman Equation:

v_{π} (s) = \sum_{a} π (a | s) \sum_{s^{'}, r} p (s^{'}, r | a) [r + γ v_{π} (s^{'})], \forall s \in S

State-value function is often referred to as the value function, or even the V-function, or more simply

V_{π} (s)

Action-value function: What should I expect from here if I do this?

选取最优策略同时要考虑动作，不仅仅关于状态价值，同时也要量化在状态 $s$ 选取动作 $a$ 的收益

Q-function or $Q_{π} (s, a)$ : the expected return if the agent follows policy $π$ after taking action $a$ in state $s$ .

q_{π} (s, a) = \sum_{s^{'}, r} p (s^{'}, r | s, a) [r + γ v_{π} (s^{'})], \forall s \in S, \forall a \in A (s)

矩阵形式即 $\sum_{s^{'}, r} (r + γ P_{π} v_{π} (s^{'}))$

Action-advantage function: How much better if I do that?

A-function: $A_{π} (s, a) = q_{π} (s, a) - v_{π} (s)$

Calculating of State-value, action-value, and action-advantage functions

img-20241117150120872|500

Bellman optimality equation

本节中 $V^{*}, v_{*}, V_{π^{*}}$ 之间以及 $Q^{*}, q_{*}, Q_{π^{*}}$ 可替换，所表达含义一致

最优价值函数： $V^{*} (s) = max_{π} V_{π} (s)$
最优动作-价值函数： $Q^{*} (s, a) = max_{π} Q (s, a)$
最佳策略： $π^{*} (s) = \underset{π}{\arg max} V_{π} (s)$
最佳策略所对应的动作与最优价值函数应满足

π^{*} (a ∣ s) = {\begin{cases} 1, & a = \underset{a \in A}{\arg max} Q^{*} (s, a) \\ 0, & others \end{cases}

因此有 Bellman 最优方程

\begin{aligned} V^{*} (s) & = \sum_{a} π (a | s) \sum_{s^{'}, r} p (s^{'}, r | a) [r + γ V^{*} (s^{'})] \\ = \sum_{s^{'}, r} p (s^{'}, r | a^{*}) [r + γ V^{*} (s^{'})] \\ = Q_{π^{*}} (s, a^{*}) \\ = max_{a} Q_{π^{*}} (s, a) \end{aligned}

贝尔曼最优方程表明：最佳策略下的一个状态的价值必须等于在这个状态下采取最好动作得到的回报的期望。

最优价值函数的 Bellman 方程

v_{*} (s) = \max_{a} \sum_{s^{'}, r} p (s^{'}, r | s, a) [r + γ v_{*} (s^{'})]

最优动作-价值函数的 Bellman 方程

q_{*} (s, a) = \sum_{s^{'}, r} p (s^{'}, r | s, a) [r + γ \max_{a^{'}} q_{*} (s^{'}, a^{'})]

Planning optimal sequences of actions——迭代算法

策略迭代

Policy evaluation: Rating policies

策略评估通过遍历状态空间和迭代改进估计来给定策略的 V 函数。

通过迭代计算 $v_{k + 1} (s) = \sum_{a} π (a | s) \sum_{s^{'}, r} p (s^{'}, r | s, a) [r + γ v_{k} (s^{'})]$ 直至收敛（ $∥ v_{k + 1} - v_{k} ∥ < ϵ$ ）
SWF 环境策略评估举例

This technique of calculating an estimate from an estimate is referred to as bootstrapping, and it’s a widely used technique in RL (including DRL).

FL 环境策略评估举例

比较终止态 GOAL 的 value 可以评估出 Careful policy 较好一点

Policy improvement: Using ratings to get better

对已有策略进行改善，而非穷举策略后进行比较评估

policy-improvement algorithm: 利用 $V$ 函数与 MDP计算 $Q$ 函数，并返回一个改进的贪婪策略

Policy Improvement

If $π_{k + 1} = \arg \max_{π} (r_{π} + γ P_{π} v_{π_{k}})$ , then $v_{π_{k + 1}} \geq v_{π_{k}}$

Proof

Since $π_{k + 1} = \arg max_{π} (r_{π} + γ P_{π} v_{π_{k}})$ , we know that

r_{π_{k + 1}} + γ P_{π_{k + 1}} v_{π_{k}} \geq r_{π_{k}} + γ P_{π_{k}} v_{π_{k}}

Therefore,

\begin{aligned} v_{π_{k}} - v_{π_{k + 1}} & = (r_{π_{k}} + γ P_{π_{k}} v_{π_{k}}) - (r_{π_{k + 1}} + γ P_{π_{k + 1}} v_{π_{k + 1}}) \\ \leq (r_{π_{k + 1}} + γ P_{π_{k + 1}} v_{π_{k}}) - (r_{π_{k + 1}} + γ P_{π_{k + 1}} v_{π_{k + 1}}) \\ \leq γ P_{π_{k ⊥ 1}} (v_{π_{k}} - v_{π_{k ⊥ 1}}) . \end{aligned}

\begin{aligned} v_{π_{k}} - v_{π_{k + 1}} \leq γ^{2} P_{π_{k + 1}}^{2} (v_{π_{k}} - v_{π_{k + 1}}) \leq \dots & \leq γ^{n} P_{π_{k + 1}}^{n} (v_{π_{k}} - v_{π_{k + 1}}) \\ \leq \lim_{n \to \infty} γ^{n} P_{π_{k + 1}}^{n} (v_{π_{k}} - v_{π_{k + 1}}) = 0. \end{aligned}

Policy iteration: Improving upon improved behaviors

策略迭代=策略评估+策略改进，交替进行，直至收敛（策略保持一致）

Policy iteration algorithm

价值迭代

策略迭代在每一轮完整的迭代策略评估流程后再进行 $V$ 函数更新，但缓慢。但即使在一次迭代后截断了策略评估，仍然可以通过在一次策略评估的状态空间扫描后利用 $Q$ 函数贪婪改进初始策略。

v_{k + 1} (s) = max_{a} \sum_{s^{'}, r} p (s^{'}, r | s, a) [r + γ v_{k} (s^{'})]

VI algorithm

本篇的 VI，PI 算法均假设智能体可完全访问 MDP，不存在不确定性，可以直接计算期望值。这意味着无需探索，没必要互动，也就没必要试错学习（RL 很重要的一点）。因此很难应用在诸多实际问题中。

虽然 VI 和 PI 是两种不同的算法，但从更一般的角度来看，它们是广义策略迭代（generalized policy iteration, GPI）的两个实例。GPI 是 RL 中的一个一般概念，其中策略使用其价值函数估计来改进，而价值函数估计则改进到当前策略的实际价值函数。

为何迭代有效？

PI, VI 收敛到最优性证明

Contraction mapping theorem

Contraction mapping

The function $f$ is a contraction mapping (or contractive function) if there exists $γ \in (0, 1)$ such that

\begin{array}{r} ∥ f (x_{1}) - f (x_{2}) ∥ \leq γ ∥ x_{1} - x_{2} ∥ \end{array}

for any $x_{1}, x_{2} \in R^{d}$ . $∥ \cdot ∥$ denotes a vector or matrix norm.

Contraction mapping theorem

For any equation that has the form $x = f (x)$ where $x$ and $f (x)$ are real vectors, if $f$ is a contraction mapping, then the following properties hold.

Existence: $\exists x^{*}, f (x^{*}) = x^{*}$
Uniqueness: The fixed point $x^{*}$ is unique.
Algorithm: Consider the iterative process:

x_{k + 1} = f (x_{k})

where $k = 0, 1, 2, \dots$ . Then, $x_{k} \to x^{*}$ as $k \to \infty$ for any initial guess $x_{0}$ . Moreover,the convergence rate is exponentially fast.

Proof of the Contraction mapping theorem

!../../PHOTO/【强化学习入门】动态规划求解MDP/img-20241117204606248.png
!../../PHOTO/【强化学习入门】动态规划求解MDP/img-20241117204623824.png
!../../PHOTO/【强化学习入门】动态规划求解MDP/img-20241117204635220.png

可以看出 $v_{k + 1} (s) = max_{a} \sum_{s^{'}, r} p (s^{'}, r | s, a) [r + γ v_{k} (s^{'})]$ 是contraction mapping，根据最优价值函数的 Bellman 方程，可得到

Existence, uniqueness, and algorithm of optimal policy

For $v = max_{π \in Π} (r π + γ P π v)$ , there always exists a unique solution $v^{*}$ , which can be solved iteratively by

v_{k + 1} = f (v_{k}) = max_{π \in Π} (r_{π} + γ P_{π} v_{k}), k = 0, 1, 2, \dots

The value of $v_{k}$ converges to $v^{*}$ exponentially fast as $k \to \infty$ given any initial guess $v_{0}$ .

即说明了价值迭代算法的收敛性，最优性

Why can the PI algorithm eventually find an optimal policy?

Convergence of policy iteration

The state value sequence ${v_{π_{k}}}_{k = 0}^{\infty}$ generated by the policy iteration algorithm converges to the optimal state value $v^{*} .$ As $a$ result, the policy sequence ${π_{k}}_{k = 0}^{\infty}$ converges $t o$ an optimal policy.

Proof

The idea of the proof is to show that the policy iteration algorithm converges faster than the value iteration algorithm.
In particular, to prove the convergence of ${v_{π_{k}}}_{k = 0}^{\infty}$ , we introduce another sequence ${v_{k}}_{k = 0}^{\infty}$ generated by

v_{k + 1} = f (v_{k}) = max_{π} (r_{π} + γ P_{π} v_{k})

This iterative algorithm is exactly the value iteration algorithm. We already know that $v_{k}$ converges to $v^{*}$ when given any initial value $v_{0}$ .
For $k = 0$ , we can always find a $v_{0}$ such that $v_{π_{0}} \geq v_{0}$ for any $π_{0} .$
We next show that $v_{k} \leq v_{π_{k}} \leq v^{*}$ for all $k$ by induction.
For $k \geq 0$ , suppose that $v_{π_{k}} \geq v_{k}$ .
For $k + 1$ , we have

\begin{aligned} v_{π_{k + 1}} - v_{k + 1} & = (r_{π_{k + 1}} + γ P_{π_{k + 1}} v_{π_{k + 1}}) - max_{π} (r_{π} + γ P_{π} v_{k}) \\ \geq (r_{π_{k + 1}} + γ P_{π_{k + 1}} v_{π_{k}}) - max_{π} (r_{π} + γ P_{π} v_{k}) \\ = (r_{π_{k + 1}} + γ P_{π_{k + 1}} v_{π_{k}}) - (r_{π_{k}^{'}} + γ P_{π_{k}} v_{k}) \\ \geq (r_{π_{k}^{'}} + γ P_{π_{k}^{'}} v_{π_{k}}) - (r_{π_{k}^{'}} + γ P_{π_{k}^{'}} v_{k}) \\ = γ P_{π_{k}^{'}} (v_{π_{k}} - v_{k}) . \end{aligned}

Since $v_{π_{k}} - v_{k} \geq 0$ and $P_{π_{k}^{'}}$ is nonnegative, we have $P_{π_{k}^{'}} (v_{π_{k}} - v_{k}) \geq 0$ and hence $v_{π_{k + 1}} - v_{k + 1} \geq 0$ . Therefore, we can show by induction that $v_{k} \leq v_{π_{k}} \leq v^{*}$ for any $k \geq 0.$ Since $v_{k}$ converges to $v^{*}, v_{π_{k}}$ also converges to $v^{*}$ .