【RL入门】动态规划求解MDP

在 DRL 中,智能体从反馈中学习,这些反馈同时具有惯序性,评估性,采样性。
sequential (as opposed to one shot), evaluative (as opposed to supervised), and sampled (as opposed to exhaustive).
本篇仅关注惯序性反馈,即环境转移概率,“真实”奖励已知

The objective of agent

Maxmize the return

maxπk=0γkRt+k+1
The Slippery Walk Five (SWF) environment

Slippery Walk Five (SWF) is a one-row grid-world environment (a walk), stochastic
img-20241114215047210|500

Policies: Per-state action prescriptions

img-20241117143729956|500

State-value function: What to expect from here?

寻找 return 的期望!

Example

img-20241117144103060|500
vπ(14)=13vπ(10)+13vπ(14)+13(vπ(15)+1)

vπ(s)=aπ(a|s)s,rp(s,r|a)[r+γvπ(s)],sS

img-20241117144513202|500

State-value function is often referred to as the value function, or even the V-function, or more simply Vπ(s)

Action-value function: What should I expect from here if I do this?

选取最优策略同时要考虑动作,不仅仅关于状态价值,同时也要量化在状态 s 选取动作 a 的收益

qπ(s,a)=s,rp(s,r|s,a)[r+γvπ(s)],sS,aA(s)

矩阵形式即 s,r(r+γPπvπ(s))

img-20241117145828810|500

Action-advantage function: How much better if I do that?

Bellman optimality equation

本节中 V,v,Vπ 之间以及 Q,q,Qπ 可替换,所表达含义一致

π(as)={1,a=argmaxaAQ(s,a)0,others

因此有 Bellman 最优方程

V(s)=aπ(a|s)s,rp(s,r|a)[r+γV(s)]=s,rp(s,r|a)[r+γV(s)]=Qπ(s,a)=maxaQπ(s,a)

贝尔曼最优方程表明:最佳策略下的一个状态的价值必须等于在这个状态下采取最好动作得到的回报的期望

v(s)=maxas,rp(s,r|s,a)[r+γv(s)] q(s,a)=s,rp(s,r|s,a)[r+γmaxaq(s,a)]

Planning optimal sequences of actions——迭代算法

策略迭代

1|400

Policy evaluation: Rating policies

策略评估通过遍历状态空间和迭代改进估计来给定策略的 V 函数。

This technique of calculating an estimate from an estimate is referred to as bootstrapping, and it’s a widely used technique in RL (including DRL).

比较终止态 GOAL 的 value 可以评估出 Careful policy 较好一点

Policy improvement: Using ratings to get better

对已有策略进行改善,而非穷举策略后进行比较评估
img-20241117160943704|500

Policy Improvement

If πk+1=argmaxπ(rπ+γPπvπk), then vπk+1vπk

Policy iteration: Improving upon improved behaviors

策略迭代=策略评估+策略改进,交替进行,直至收敛(策略保持一致)

价值迭代

img-20241117200736998|500
策略迭代在每一轮完整的迭代策略评估流程后再进行 V 函数更新,但缓慢。但即使在一次迭代后截断了策略评估,仍然可以通过在一次策略评估的状态空间扫描后利用 Q 函数贪婪改进初始策略。

vk+1(s)=maxas,rp(s,r|s,a)[r+γvk(s)]

img-20241117202447823|500

本篇的 VI,PI 算法均假设智能体可完全访问 MDP,不存在不确定性,可以直接计算期望值。这意味着无需探索,没必要互动,也就没必要试错学习(RL 很重要的一点)。因此很难应用在诸多实际问题中。
虽然 VI 和 PI 是两种不同的算法,但从更一般的角度来看,它们是广义策略迭代(generalized policy iteration, GPI)的两个实例。GPI 是 RL 中的一个一般概念,其中策略使用其价值函数估计来改进,而价值函数估计则改进到当前策略的实际价值函数。

为何迭代有效?

PI, VI 收敛到最优性证明

Contraction mapping theorem

Contraction mapping

The function f is a contraction mapping (or contractive function) if there exists γ(0,1) such that

f(x1)f(x2)γx1x2

for any x1,x2Rd. denotes a vector or matrix norm.

可以看出 vk+1(s)=maxas,rp(s,r|s,a)[r+γvk(s)] 是contraction mapping,根据最优价值函数的 Bellman 方程,可得到

Existence, uniqueness, and algorithm of optimal policy

For v=maxπΠ(rπ+γPπv), there always exists a unique solution v , which can be solved iteratively by

vk+1=f(vk)=maxπΠ(rπ+γPπvk),k=0,1,2,

The value of vk converges to v exponentially fast as k given any initial guess v0.

即说明了价值迭代算法的收敛性,最优性

Why can the PI algorithm eventually find an optimal policy?

Convergence of policy iteration

The state value sequence {vπk}k=0 generated by the policy iteration algorithm converges to the optimal state value v. As a result, the policy sequence {πk}k=0 converges to an optimal policy.


© 2024 LiQ :) 由 Obsidian&Github 强力驱动