在 DRL 中,智能体从反馈中学习,这些反馈同时具有惯序性,评估性,采样性。 sequential (as opposed to one shot), evaluative (as opposed to supervised), and sampled (as opposed to exhaustive).
本篇仅关注惯序性反馈,即环境转移概率,“真实”奖励已知
The objective of agent
Maxmize the return
The Slippery Walk Five (SWF) environment
Slippery Walk Five (SWF) is a one-row grid-world environment (a walk), stochastic
【Example of Reward】
An episode in the SWF environment went this way: State 3 (0 reward), state 4 (0 reward), state 5 (0 reward), state 4 (0 reward), state 5 (0 reward), state 6 (+1 reward).
Policies: Per-state action prescriptions
Plan: a sequence of actions from the START state to the GOAL state.
仅仅有从起始状态到终止态的规划是否足够?
对随机环境是不够的,因为可能出现到达其他状态的可能
Policy: universal plans; cover all possible states; can be stochastic or deterministic
How to compare policies? ——>State-value function
State-value function: What to expect from here?
寻找 return 的期望!
Example
the value of a state when following a policy : the value of a state under policy is the expectation of returns if the agent follows policy starting from state .
Bellman Equation:
State-value function is often referred to as the value function, or even the V-function, or more simply
Action-value function: What should I expect from here if I do this?
选取最优策略同时要考虑动作,不仅仅关于状态价值,同时也要量化在状态 选取动作 的收益
Q-function or : the expected return if the agent follows policy after taking action in state .
矩阵形式即
Action-advantage function: How much better if I do that?
A-function:
Calculating of State-value, action-value, and action-advantage functions
Bellman optimality equation
本节中 之间以及 可替换,所表达含义一致
最优价值函数:
最优动作-价值函数:
最佳策略:
最佳策略所对应的动作与最优价值函数应满足
因此有 Bellman 最优方程
贝尔曼最优方程表明:最佳策略下的一个状态的价值必须等于在这个状态下采取最好动作得到的回报的期望。
最优价值函数的 Bellman 方程
最优动作-价值函数的 Bellman 方程
Planning optimal sequences of actions——迭代算法
策略迭代
Policy evaluation: Rating policies
策略评估通过遍历状态空间和迭代改进估计来给定策略的 V 函数。
通过迭代计算 直至收敛()
SWF 环境策略评估举例
This technique of calculating an estimate from an estimate is referred to as bootstrapping, and it’s a widely used technique in RL (including DRL).
Why can the PI algorithm eventually find an optimal policy?
Convergence of policy iteration
The state value sequence generated by the policy iteration algorithm converges to the optimal state value As result, the policy sequence converges an optimal policy.
Proof
The idea of the proof is to show that the policy iteration algorithm converges faster than the value iteration algorithm.
In particular, to prove the convergence of , we introduce another sequence generated by
This iterative algorithm is exactly the value iteration algorithm. We already know that converges to when given any initial value .
For , we can always find a such that for any
We next show that for all by induction.
For , suppose that .
For , we have
Since and is nonnegative, we have and hence . Therefore, we can show by induction that for any Since converges to also converges to .