【RL入门】探索-利用算法

The challenge of interpreting evaluative feedback

Key intuition: exploration builds the knowledge that allows for effective exploitation, and maximum exploitation is the ultimate goal of any decision maker.

MAB problem

该问题来源于赌场中的老虎机，每个老虎机都有一个“臂”（bandit），玩家拉动臂后，老虎机会随机给出奖励。MAB问题就是要找到一个策略，让你在尝试不同的老虎机（探索）和坚持你认为最好的老虎机（利用）之间做出最佳选择，以便在有限的尝试次数内获得最多的奖励。

Regret: The cost of exploration

Regret：MAB 的最优值与所选动作的真实 Q 值之差
Total regret

T = \sum_{e = 1}^{E} E [v_{*} - q_{*} (A_{e})]

Approaches to solving MAB

主要有 3 种方法：

随机性探索策略
在动作选择过程中引入随机性（引入 $ϵ$ 阈值进行随机均匀动作选取）。如 Random policy, epsilon-greedy, Decaying epsilon-greedy
Q： $ϵ$ 的取值？是否要保持随机性恒定？
乐观探索策略（optimistic exploration strategies）
量化决策问题中的不确定性，更倾向于不确定性高的状态（即假设未经历过的是好的）。随着不断探索，估计值会逐渐接近真实值
信息状态空间探索策略（information state-space exploration strategies.）
将智能体信息状态建模为环境的一部分，将不确定性进行编码。但如此增加了状态空间的大小从而增大复杂性

$↓$ 下面即以 BSW环境举例说明不同探索策略

Greedy: Always exploiting

Baselin: greedy strategy, or pure exploitation strategy.

Random: Always explore

exploit 的方式仅有一种（最大化 reward），但 explore 的方式有很多种，based on confidence, or based on uncertainty...

$ϵ$ -greedy: Almost always greedy and sometimes random

以 $ϵ$ 的概率选择 random 策略， $1 - ϵ$ 的概率选择 greedy 策略

Notice that the “exploration step” in epsilon-greedy includes the greedy action.

Decaying $ϵ$ -greedy: First maximize exploration, then exploitation

从一个 $ϵ \leq 1$ 的高 $ϵ$ 开始，随着步数逐渐衰减 $ϵ$ 。（具体衰减方式可变，如线性衰减，指数衰减...）

Optimistic initialization

把尚未充分探索的动作视为最佳可能动作，一种乐观探索策略

initialize the Q-function to a high value and act greedily using these estimates.
initialize the counts to a value higher than 1 (避免 Q 变化过快)

然而实际中初始化的 high value (Q=1)很难得知，在后续章节 UCB算法中详细讨论

Simulation

Two-armed Bernoulli bandit environment

Performance

The pure exploitation and exploration baselines on five two-armed Bernoulli bandit environments with probabilities α and β initialized uniformly at random, and five seeds. Results are means across 25 runs.

Strategic exploration

The epsilon-greedy strategy (and its decaying versions) is still the most popular exploration strategy in use today, perhaps because it performs well, perhaps because of its simplicity. Maybe it’s because most reinforcement learning environments today live inside a computer, and there are very few safety concerns with the virtual world.

Softmax: Select actions randomly in proportion to their estimates

将随机选取动作的概率与估计的 action-value function 相联系，避免低 Q 值动作

π (a) = \frac{e x p (\frac{Q (b)}{τ})}{\sum_{b = 0}^{B} e x p (\frac{Q (b)}{τ})}

可添加超参控制算法对 Q 值估计值差异的敏感性

UCB: It's not about optimism, it's about realistic optimism

对于简单的乐观初始化策略，明显具有两个缺陷：并不总清楚系统的最大 reward，“计数变量”的调整与设定

置信上界（upper confidence bound, UCB）策略：使用统计方法计算估计值的不确定性并将其作为探索的奖励
select the action with the highest sum of its Q-value estimate and an action-uncertainty bonus $U$

A_{e} = \underset{a}{argmax} ⌈ Q_{e} (a) + c \sqrt{\frac{\ln e}{N_{e} (a)}} ⌉

其中 $e$ 为 episode 迭代数， $N_{e} (a)$ 为此前选择动作 $a$ 的次数， $c$ 为超参系数控制奖励的比例。

假设当前某个 bandit的真实奖励概率是 $p$ ，它被按下了 $n$ 次，其中有 $n_{r}$ 次获得了奖励，它的预估奖励概率是 $\tilde{p} = n_{r} / n$ 。如果能够根据我们按按钮的经验找到一个 $δ$ ，使得 $p \leq \tilde{p} + δ$ 恒成立，那么 $\tilde{p} + δ$ 就是真实概率 $p$ 的上界，或者说 $\tilde{p} + δ$ 决定了 $p$ 的取值范围。因此在 $\tilde{p}$ 确定的情况下， $δ$ 越大 $p$ 的取值范围也就越大。显然 $δ$ 能够在一定程度上反映 bandit的不确定性。那怎么找到这个 $δ$ 呢？——霍夫丁不等式（Hoeffding's inequality）。^[1]

Hoeffding's inequality

设 $X_{1}, \dots, X_{n}$ 是取值范围为 $[0, 1]$ 的 $n$ 个独立同分布随机变量，用 $\tilde{p} = \frac{1}{n} \sum_{i = 1}^{n} X_{i}$ 表示它们的样本均值，用 $p$ 表示它们的分布的均值。那么有

P {p - \tilde{p} \leq δ} \geq 1 - e^{- 2 n δ^{2}}

由于要让 $p \leq \tilde{p} + δ$ 这件事发生的概率足够大，近似地认为它一定会发生。不妨用当前按下按钮的总次数 $N$ 来定义“足够大”， $(N - 1) / N$ 看上去就足够大了，而且也满足小于1的限制。令 $1 - e^{- 2 n δ^{2}} = \frac{N - 1}{N}$ 解得 $δ = \sqrt{\frac{\ln N}{2 n}}$

Thompson sampling: Balancing reward and risk

Thompson sampling strategy is a sample-based probability matching strategy that allows us to use Bayesian techniques to balance the exploration and exploitation trade-off.

相比 UCB，可以利用专家知识的先验信息

Simulate

10-armed Gaussian bandit environment

Performance

【强化学习】多臂老虎机的上置信界算法（全网最通俗） ↩︎

The challenge of interpreting evaluative feedback

MAB problem

Regret: The cost of exploration

Approaches to solving MAB

Greedy: Always exploiting

Random: Always explore

ϵ -greedy: Almost always greedy and sometimes random

Decaying ϵ -greedy: First maximize exploration, then exploitation

Optimistic initialization

Simulation

Two-armed Bernoulli bandit environment

Performance

Strategic exploration

Softmax: Select actions randomly in proportion to their estimates

UCB: It's not about optimism, it's about realistic optimism

Thompson sampling: Balancing reward and risk

Simulate

10-armed Gaussian bandit environment

Performance

$ϵ$ -greedy: Almost always greedy and sometimes random

Decaying $ϵ$ -greedy: First maximize exploration, then exploitation