【RL入门】探索-利用算法
The challenge of interpreting evaluative feedback
Key intuition: exploration builds the knowledge that allows for effective exploitation, and maximum exploitation is the ultimate goal of any decision maker.
MAB problem
该问题来源于赌场中的老虎机,每个老虎机都有一个“臂”(bandit),玩家拉动臂后,老虎机会随机给出奖励。MAB问题就是要找到一个策略,让你在尝试不同的老虎机(探索)和坚持你认为最好的老虎机(利用)之间做出最佳选择,以便在有限的尝试次数内获得最多的奖励。
Regret: The cost of exploration
- Regret:MAB 的最优值与所选动作的真实 Q 值之差
- Total regret
Approaches to solving MAB
主要有 3 种方法:
- 随机性探索策略
在动作选择过程中引入随机性(引入阈值进行随机均匀动作选取)。如 Random policy, epsilon-greedy, Decaying epsilon-greedy
Q:的取值?是否要保持随机性恒定? - 乐观探索策略(optimistic exploration strategies)
量化决策问题中的不确定性,更倾向于不确定性高的状态(即假设未经历过的是好的)。随着不断探索,估计值会逐渐接近真实值 - 信息状态空间探索策略(information state-space exploration strategies.)
将智能体信息状态建模为环境的一部分,将不确定性进行编码。但如此增加了状态空间的大小从而增大复杂性
Greedy: Always exploiting
Baselin: greedy strategy, or pure exploitation strategy.
Random: Always explore
exploit 的方式仅有一种(最大化 reward),但 explore 的方式有很多种,based on confidence, or based on uncertainty...
-greedy: Almost always greedy and sometimes random
以
Notice that the “exploration step” in epsilon-greedy includes the greedy action.
Decaying -greedy: First maximize exploration, then exploitation
从一个
Optimistic initialization
把尚未充分探索的动作视为最佳可能动作,一种乐观探索策略
- initialize the Q-function to a high value and act greedily using these estimates.
- initialize the counts to a value higher than 1 (避免 Q 变化过快)
然而实际中初始化的 high value (Q=1)很难得知,在后续章节 UCB算法 中详细讨论
Simulation
Two-armed Bernoulli bandit environment
Performance
Strategic exploration
The epsilon-greedy strategy (and its decaying versions) is still the most popular exploration strategy in use today, perhaps because it performs well, perhaps because of its simplicity. Maybe it’s because most reinforcement learning environments today live inside a computer, and there are very few safety concerns with the virtual world.
Softmax: Select actions randomly in proportion to their estimates
- 将随机选取动作的概率与估计的 action-value function 相联系,避免低 Q 值动作
- 可添加超参控制算法对 Q 值估计值差异的敏感性
UCB: It's not about optimism, it's about realistic optimism
对于简单的乐观初始化策略 ,明显具有两个缺陷:并不总清楚系统的最大 reward,“计数变量”的调整与设定
- 置信上界(upper confidence bound, UCB)策略:使用统计方法计算估计值的不确定性并将其作为探索的奖励
- select the action with the highest sum of its Q-value estimate and an action-uncertainty bonus
其中
假设当前某个 bandit的真实奖励概率是
,它被按下了 次,其中有 次获得了奖励,它的预估奖励概率是 。如果能够根据我们按按钮的经验找到一个 ,使得 恒成立,那么 就是真实概率 的上界,或者说 决定了 的取值范围。因此在 确定的情况下, 越大 的取值范围也就越大。显然 能够在一定程度上反映 bandit的不确定性。那怎么找到这个 呢?——霍夫丁不等式(Hoeffding's inequality)。[1]
设
由于要让
这件事发生的概率足够大,近似地认为它一定会发生。不妨用当前按下按钮的总次数 来定义“足够大”, 看上去就足够大了,而且也满足小于1的限制。令 解得
Thompson sampling: Balancing reward and risk
Thompson sampling strategy is a sample-based probability matching strategy that allows us to use Bayesian techniques to balance the exploration and exploitation trade-off.
- 相比 UCB,可以利用专家知识的先验信息