【RL入门】数学基础

利用马尔科夫决策过程（Markov DecisionProcesses, MDP）的数学框架建模不确定性条件下的复杂惯序决策

Components of reinforcement learning

强化学习交互过程（主要组成：智能体（Agent）、环境（Environment）、动作空间（Action Space）和观测（Observations））

RL 举例

游戏 AI（如围棋或国际象棋）
- Agent：游戏 AI。
- Environment：游戏棋盘和棋子的当前状态。
- Action Space：所有合法的棋子移动。
- Observations：棋盘的当前布局，包括己方和对方棋子的位置。
股票交易
- Agent：交易算法。
- Environment：股票市场，包括股票价格、交易量等。
- Action Space：买入、卖出或持有股票。
- Observations：股票的历史价格、交易量、市场新闻等信息。
能源管理（如智能电网）
- Agent：电网管理系统。
- Environment：电力供应和需求的动态系统，包括天气条件、用户需求等
- Action Space：调整发电量、分配电力资源。
- Observations：当前电力需求、发电成本、天气预报等。
自动驾驶车辆路径规划
- Agent：自动驾驶车辆的决策系统。
- Environment：道路网络、交通状况、天气条件等。
- Action Space：选择不同的路径和速度。
- Observations：车辆位置、速度、周围车辆和障碍物的位置。

“塞翁失马，焉知非福”。实际情境中很难把握时间导致的长期后果。RL 认为所采取的动作与导致的事件之间是有关联的，只是人类很难有把握联系起来，RL 即辅助解决这些问题。
当反馈同时具有评估性，惯序性，抽样性时，从中学习便很困难，而深度强化学习便是这类问题的一种解决方法
强化学习绕不开的落地问题

The agent: The decision maker

The environment: Everything else

A common way to represent decision-making processes in RL is by modeling the problem using a mathematical framework known as Markov decision processes (MDPs).
状态空间（state space）：一组表示环境的相关变量
状态：变量在给定时刻所取的值，也可定义为 agent’s status with respect to the environment

“观察”（observation）：智能体可能无法直接了解实际环境状态，因此观测值也许不等于真实的状态
State and observation are terms used interchangeably in the RL community. This is because often agents are allowed to see the internal state of the environment, but this isn’t always the case.

动作空间（action space）：可选取的动作集合
转移概率/函数（transition probability/function）
奖励函数（reward function）
环境模型=转移概率+奖励函数

The bandit walk environment

Bandit walk (BW) is a simple grid-world (GW) environment.

Environments that have a single non-terminal state are called“bandit”environments.“Bandit” here is an analogy to slot machines, which are also known as “one-armed bandits”
BW 动作交互说明：
Some Questions
- Why do the terminal states have actions that transition to themselves: seems wasteful?
- What if the environment is stochastic?
  See BSW environment，状态转移随机
- What exactly is an environments that is “stochastic”?

The bandit slippery walk environment

Bandit slippery walk (BSW): the surface of the walk is slippery and each action has a 20% chance of sending the agent backwards.

通常实际应用中很难直接得到转移函数，但如 model-based RL 智能体可以访问其转移函数

Other environments @gym

可从 gym 库中找到实现代码
img-20241114165105728|500
img-20241114165124455|500

Agent-environment interaction cycle

奖励的设计
- 奖励信号越密集，智能体收到的监督越多，学习越快，但输入的偏置也越多，意外行为可能性越小
- All the rewards as negative or positive?
  In fact, it is the relative reward values instead of the absolute values that determine encouragement or discouragement. That is because optimal policies are invariant to affine transformations of the rewards.
任务时长
- episodic tasks: Tasks that have a natural ending, such as a game
- continuing tasks: such as learning forward motion

MDPs: The engine of the environment

Python dictionaries representing MDPs from descriptions of the problems.

MDP 表示： $MDP (S, A, T, R, S_{θ}, γ, H)$

The frozen lake environment

Frozen lake (FL) is a simple grid-world (GW) environment.

if the agent chooses to move down, there’s a 33.3% chance it moves down, 33.3% chance it moves left, and 33.3% chance it moves right.
if the agent tries to move out of the grid world, it will bounce back to the cell from which it tried to move.
到达终点获得 +1 奖励，其他为 0

States: Specific configurations of the environment

$S$ : state space, can be finite or infinite
Markov property

Markov 性简化了问题，是一个广泛假设，但并非所有问题适合这一假设

absorbing/terminal state（吸收态/终止态）: all available actions transitioning, with probability 1, to itself, and these transitions must provide no reward

若终止态转移到自身有 reward，则一直保持在该状态便有无穷的 reward，不符合我们的目标
The BW, BSW, and FL environment are episodic tasks, because there are terminal states;

Transition function: Consequences of agent actions

$T (s, a, s^{'})$ : transition function

从转移函数可知BW环境为确定性的，BSW 和 FL 环境是随机的

许多 RL (DRL) 算法的一个关键假设即概率分布 $p (\cdot | s, a)$ 平稳，即虽然转移随机，但训练过程中概率分布不变

Actions: A mechanism to influence the environment

$A (s)$ : 动作依赖当前智能体状态
$A$ : action space, may be finite or infinite
所有可用动作提前已知，智能体可确定或随机选取动作

这与前面的环境对动作反应确定或随机不同，一个反映在转移概率 $T (s, a, s^{'})$ ，一个体现在动作选取 $A (s)$

Reward signal

$R$ : reward, can be represented as $R (s, a, s^{'}), R (s, a), R (s)$ （对 $s, a$ 取期望）

RL folks are happy folks.

Reward 设计太 happy 啦！

Horizon: Time changes what’s optimal

Finite horizon
- the agent knows the task will terminate in a finite number of time steps
- The BW and BSW have both a greedy planning horizon: the episode terminates immediately after one interaction. In fact, all bandit environments have greedy horizons.
Infinite horizon
- when the agent doesn’t have a predetermined time step limit, so the agent plans for an infinite number of time steps.

Discount: The future is uncertain, value it less

To tell the agent that getting +1 is better sooner than later.

$γ$ : Discount factor

$γ$ is part of the MDP definition, is also used as a hyperparameter for reducing variance, and therefore left for the agent to tune.

Gain: $G_{t} = R_{t + 1} + γ R_{t + 2} + γ^{2} R_{t + 3} + . . . + γ^{T - 1} R_{T}$
Recursive Gain: $G_{t} = R_{t + 1} + γ G_{t + 1}$

其他 MDP 框架

POMDP

Partially observable Markov decision process (POMDP): When the agent cannot fully observe the environment state

Continuous Time/Action/State MDP

When either time, action, state or any combination of them are continuous

SMDP

Semi-Markov decision process (SMDP): Allows the inclusion of abstract actions that can take multiple time steps to complete

MMDP

Multi-agent Markov decision process (MMDP): Allows the inclusion of multiple agents in the same environment

Dec-MDP

Decentralized Markov decision process (Dec-MDP): Allows for multiple agents to collaborate and maximize a common reward

FMDP

Factored Markov decision process (FMDP): Allows the representation of the transition and reward function more compactly so that we can represent large MDPs

RMDP

Relational Markov decision process (RMDP): Allows the combination of probabilistic and relational knowledge