【RL入门】数学基础

利用马尔科夫决策过程(Markov DecisionProcesses, MDP)的数学框架建模不确定性条件下的复杂惯序决策

Components of reinforcement learning

RL 举例

  • 游戏 AI(如围棋或国际象棋)
    • Agent:游戏 AI。
    • Environment:游戏棋盘和棋子的当前状态。
    • Action Space:所有合法的棋子移动。
    • Observations:棋盘的当前布局,包括己方和对方棋子的位置。
  • 股票交易
    • Agent:交易算法。
    • Environment:股票市场,包括股票价格、交易量等。
    • Action Space:买入、卖出或持有股票。
    • Observations:股票的历史价格、交易量、市场新闻等信息。
  • 能源管理(如智能电网)
    • Agent:电网管理系统。
    • Environment:电力供应和需求的动态系统,包括天气条件、用户需求等
    • Action Space:调整发电量、分配电力资源。
    • Observations:当前电力需求、发电成本、天气预报等。
  • 自动驾驶车辆路径规划
    • Agent:自动驾驶车辆的决策系统。
    • Environment:道路网络、交通状况、天气条件等。
    • Action Space:选择不同的路径和速度。
    • Observations:车辆位置、速度、周围车辆和障碍物的位置。

“塞翁失马,焉知非福”。实际情境中很难把握时间导致的长期后果。RL 认为所采取的动作与导致的事件之间是有关联的,只是人类很难有把握联系起来,RL 即辅助解决这些问题。
当反馈同时具有评估性,惯序性,抽样性时,从中学习便很困难,而深度强化学习便是这类问题的一种解决方法
强化学习绕不开的落地问题

The agent: The decision maker

img-20241114162338377|500

The environment: Everything else

“观察”(observation):智能体可能无法直接了解实际环境状态,因此观测值也许不等于真实的状态
State and observation are terms used interchangeably in the RL community. This is because often agents are allowed to see the internal state of the environment, but this isn’t always the case.

The bandit walk environment

Bandit walk (BW) is a simple grid-world (GW) environment.
img-20241114163322871|500

The bandit slippery walk environment

Bandit slippery walk (BSW): the surface of the walk is slippery and each action has a 20% chance of sending the agent backwards.
img-20241114163837165|500

通常实际应用中很难直接得到转移函数,但如 model-based RL 智能体可以访问其转移函数

Agent-environment interaction cycle

img-20241114164331344|500

MDPs: The engine of the environment

Python dictionaries representing MDPs from descriptions of the problems.

MDP 表示:MDP(S,A,T,R,Sθ,γ,H)

The frozen lake environment

Frozen lake (FL) is a simple grid-world (GW) environment.

States: Specific configurations of the environment

Markov 性简化了问题,是一个广泛假设,但并非所有问题适合这一假设

若终止态转移到自身有 reward,则一直保持在该状态便有无穷的 reward,不符合我们的目标
The BW, BSW, and FL environment are episodic tasks, because there are terminal states;
img-20241114165952878|500

Transition function: Consequences of agent actions

从转移函数可知BW环境为确定性的,BSW 和 FL 环境是随机的

img-20241114171604279|500

许多 RL (DRL) 算法的一个关键假设即概率分布 p(|s,a) 平稳,即虽然转移随机,但训练过程中概率分布不变

Actions: A mechanism to influence the environment

这与前面的环境对动作反应确定或随机不同,一个反映在转移概率 T(s,a,s),一个体现在动作选取 A(s)

Reward signal

img-20241114172116764|500
img-20241114172138891|500

Horizon: Time changes what’s optimal

Discount: The future is uncertain, value it less

To tell the agent that getting +1 is better sooner than later.
img-20241114185502579|500

γ is part of the MDP definition, is also used as a hyperparameter for reducing variance, and therefore left for the agent to tune.

其他 MDP 框架

POMDP

Partially observable Markov decision process (POMDP): When the agent cannot fully observe the environment state
img-20241114190254928|500

Continuous Time/Action/State MDP

When either time, action, state or any combination of them are continuous

SMDP

Semi-Markov decision process (SMDP): Allows the inclusion of abstract actions that can take multiple time steps to complete

MMDP

Multi-agent Markov decision process (MMDP): Allows the inclusion of multiple agents in the same environment

Dec-MDP

Decentralized Markov decision process (Dec-MDP): Allows for multiple agents to collaborate and maximize a common reward

FMDP

Factored Markov decision process (FMDP): Allows the representation of the transition and reward function more compactly so that we can represent large MDPs

RMDP

Relational Markov decision process (RMDP): Allows the combination of probabilistic and relational knowledge


© 2024 LiQ :) 由 Obsidian&Github 强力驱动