【RL入门】探索-利用算法

The challenge of interpreting evaluative feedback

Key intuition: exploration builds the knowledge that allows for effective exploitation, and maximum exploitation is the ultimate goal of any decision maker.
img-20241118205042560|300

MAB problem

img-20241118205254132|300
该问题来源于赌场中的老虎机,每个老虎机都有一个“臂”(bandit),玩家拉动臂后,老虎机会随机给出奖励。MAB问题就是要找到一个策略,让你在尝试不同的老虎机(探索)和坚持你认为最好的老虎机(利用)之间做出最佳选择,以便在有限的尝试次数内获得最多的奖励。
img-20241118205601509|500

Regret: The cost of exploration

T=e=1EE[vq(Ae)]

Approaches to solving MAB

主要有 3 种方法:

下面即以 BSW环境 举例说明不同探索策略

Greedy: Always exploiting

Baselin: greedy strategy, or pure exploitation strategy.
img-20241118211119869|500

Random: Always explore

img-20241118211307260|500

exploit 的方式仅有一种(最大化 reward),但 explore 的方式有很多种,based on confidence, or based on uncertainty...

ϵ -greedy: Almost always greedy and sometimes random

ϵ 的概率选择 random 策略,1ϵ 的概率选择 greedy 策略

Notice that the “exploration step” in epsilon-greedy includes the greedy action.

Decaying ϵ -greedy: First maximize exploration, then exploitation

从一个 ϵ1 的高 ϵ 开始,随着步数逐渐衰减 ϵ 。(具体衰减方式可变,如线性衰减,指数衰减...)

Optimistic initialization

把尚未充分探索的动作视为最佳可能动作,一种乐观探索策略

然而实际中初始化的 high value (Q=1)很难得知,在后续章节 UCB算法 中详细讨论

Simulation

Two-armed Bernoulli bandit environment

img-20241118213016778|500

Performance

img-20241118213100036|500

The pure exploitation and exploration baselines on five two-armed Bernoulli bandit environments with probabilities α and β initialized uniformly at random, and five seeds. Results are means across 25 runs.

Strategic exploration

The epsilon-greedy strategy (and its decaying versions) is still the most popular exploration strategy in use today, perhaps because it performs well, perhaps because of its simplicity. Maybe it’s because most reinforcement learning environments today live inside a computer, and there are very few safety concerns with the virtual world.

Softmax: Select actions randomly in proportion to their estimates

π(a)=exp(Q(b)τ)b=0Bexp(Q(b)τ)

UCB: It's not about optimism, it's about realistic optimism

对于简单的乐观初始化策略 ,明显具有两个缺陷:并不总清楚系统的最大 reward,“计数变量”的调整与设定

Ae=argmaxaQe(a)+clneNe(a)

其中 e 为 episode 迭代数,Ne(a) 为此前选择动作 a 的次数,c 为超参系数控制奖励的比例。

假设当前某个 bandit的真实奖励概率是 p,它被按下了 n 次,其中有 nr 次获得了奖励,它的预估奖励概率是 p~=nr/n 。如果能够根据我们按按钮的经验找到一个 δ ,使得 pp~+δ 恒成立,那么 p~+δ 就是真实概率 p 的上界,或者说 p~+δ 决定了 p 的取值范围。因此在 p~ 确定的情况下, δ 越大 p 的取值范围也就越大。显然 δ 能够在一定程度上反映 bandit的不确定性。那怎么找到这个 δ 呢?——霍夫丁不等式(Hoeffding's inequality)。[1]

由于要让 pp~+δ 这件事发生的概率足够大,近似地认为它一定会发生。不妨用当前按下按钮的总次数 N 来定义“足够大”,(N1)/N 看上去就足够大了,而且也满足小于1的限制。令 1e2nδ2=N1N 解得 δ=lnN2n

Thompson sampling: Balancing reward and risk

Thompson sampling strategy is a sample-based probability matching strategy that allows us to use Bayesian techniques to balance the exploration and exploitation trade-off.

Simulate

10-armed Gaussian bandit environment

img-20241119111936478|300

Performance

img-20241119104349478|500


  1. 【强化学习】多臂老虎机的上置信界算法(全网最通俗) ↩︎


© 2024 LiQ :) 由 Obsidian&Github 强力驱动