Universal Reinforcement Learning
论文链接: Universal Reinforcement Learning
We consider an agent interacting with an unmodeled environment. At each time, the agent makes an observation, takes an action, and incurs a cost. Its actions can influence future observations and costs. The goal is to minimize the long-term average cost. We propose a novel algorithm we call the active LZ algorithm for optimal control based on ideas from the Lempel-Ziv scheme for universal data compression and prediction. We establish that, under the active LZ algorithm, if there exists an integer K such that the future is conditionally independent of the past given a window of K consecutive actions and observations, then the average cost converges to the optimum. Experimental results involving the game of Rock-Paper-Scissors illustrate merits of the algorithm.
Notes
Main
Contribution
- 利用前 K 个 slots 的历史预测当前的概率分布(Dirichlet-1/2 prior),从而构建了环境(即转移概率 P),然后便可使用 Bellman 最优方程求解(discounted)
Details
- unmodeled environment
- minimize the long-term average cost
- K-Markov property
- neither
nor even are known to the agent - optimal average cost over stationary policies
- 【Assumption】The optimal average cost is independent of the initial state. That is, there exists a constant
so that - If the transition kernel
(and, thereby, ) were known 有 Bellman 最优方程
- time is parsed into intervals, or ‘phrases’, with the property that if the
th phrase covers the time intervals - estimate of
(Dirichlet-1/2 prior with a multinomial likelihood)
is the number of times the context has been visited prior to time
Performance of the active LZ algorithm on Rock-Paper-Scissors relative to the predictive LZ algorithm and the optimal policy.
- Choice of Discount Factor: 选择趋于 1 的序列作为 discouted factor
Memo
-
📘 unmodeled environment
-
📘 minimize the long-term average cost.
-
📘 active LZalgorithm for optimal control
-
📘 if there exists an integer K such that thefuture is conditionally independent of the past given a window of K consecutive actions and observations, then the average cost converges to the optimum
-
📗 where neither P nor even K are known tothe agent
-
📗 there is a finite but unknowndependence on history.
-
📗 such a P has special structure in that, for example, ithas no dependence on the player’s action At−1 in gamet, since this is unknown to the opponent until after game tis played.)
-
📘 finding the optimal strategyagainst an unknown, finite-memory opponent
-
📗 Bertsekas [10]for a discussion of the structural properties of average costMarkov decision problems
-
📘 If the transition kernel P (and, thereby, K) were known
-
📘 dynamic programming
-
📗 set A∗ α(xK, aK−1) ofα-discounted optimal actions to be the set of minimizersto the optimization program
-
📘 Blackwell optimal policy