Universal Reinforcement Learning

Abstract

We consider an agent interacting with an unmodeled environment. At each time, the agent makes an observation, takes an action, and incurs a cost. Its actions can inﬂuence future observations and costs. The goal is to minimize the long-term average cost. We propose a novel algorithm we call the active LZ algorithm for optimal control based on ideas from the Lempel-Ziv scheme for universal data compression and prediction. We establish that, under the active LZ algorithm, if there exists an integer K such that the future is conditionally independent of the past given a window of K consecutive actions and observations, then the average cost converges to the optimum. Experimental results involving the game of Rock-Paper-Scissors illustrate merits of the algorithm.

Notes

Main

Contribution

利用前 K 个 slots 的历史预测当前的概率分布（Dirichlet-1/2 prior），从而构建了环境（即转移概率 P），然后便可使用 Bellman 最优方程求解（discounted）

Details

unmodeled environment
minimize the long-term average cost

\underset{T \to \infty}{lim sup} E [\frac{1}{T} \sum_{t = 1}^{T} g (X_{t}, A_{t}, X_{t + 1})]

K-Markov property

Pr (X_{t} = x_{t} | F_{t - 1}) = P (x_{t} | X_{t - K}^{t - 1}, A_{t - K}^{t - 1})

neither $P$ nor even $K$ are known to the agent
optimal average cost over stationary policies

\begin{array}{r} λ^{*} (x^{K}, a^{K - 1}) = inf_{ν} \underset{T \to \infty}{lim sup} E_{ν} [\frac{1}{T} \sum_{t = 1}^{T} g (X_{t}, A_{t}, X_{t + 1}) | x^{K}, a^{K - 1}] \end{array}

【Assumption】The optimal average cost is independent of the initial state. That is, there exists a constant $λ^{*}$ so that $λ^{*} (x^{K}, a^{K - 1}) = λ^{*}, \forall (x^{K}, a^{K - 1}) \in X^{K} \times A^{K - 1}$
If the transition kernel $P$ (and, thereby, $K$ ) were known 有 Bellman 最优方程

\begin{aligned} J (x^{K}, a^{K - 1}) = min_{a_{K}} \sum_{x_{K + 1}} P (x_{K + 1} | x^{K}, a^{K}) \times [g (x_{k}, a_{k}, x_{K + 1}) + α J (x_{2}^{K + 1}, a_{2}^{K})] \end{aligned}

time is parsed into intervals, or ‘phrases’, with the property that if the $c$ th phrase covers the time intervals $τ_{c} \leq t \leq τ_{c + 1} - 1$
estimate of $P$ (Dirichlet-1/2 prior with a multinomial likelihood)

\hat{P} (x_{ℓ + 1} | x^{ℓ}, a^{ℓ}) = \frac{N (x^{ℓ + 1}, a^{ℓ}) + 1 / 2}{\sum_{x^{'}} N ((x^{ℓ}, x^{'}), a^{ℓ}) + | X | / 2}

$N (x^{l + 1}, a^{l})$ is the number of times the context $(x^{l + 1}, a^{l})$ has been visited prior to time $t$

Performance of the active LZ algorithm on Rock-Paper-Scissors relative to the predictive LZ algorithm and the optimal policy.

Choice of Discount Factor：选择趋于 1 的序列作为 discouted factor

Memo

📘 unmodeled environment
📘 minimize the long-term average cost.
📘 active LZalgorithm for optimal control
📘 if there exists an integer K such that thefuture is conditionally independent of the past given a window of K consecutive actions and observations, then the average cost converges to the optimum
📗 where neither P nor even K are known tothe agent
📗 there is a finite but unknowndependence on history.
📗 such a P has special structure in that, for example, ithas no dependence on the player’s action At−1 in gamet, since this is unknown to the opponent until after game tis played.)
📘 finding the optimal strategyagainst an unknown, finite-memory opponent
📗 Bertsekas [10]for a discussion of the structural properties of average costMarkov decision problems
📘 If the transition kernel P (and, thereby, K) were known
📘 dynamic programming
📗 set A∗ α(xK, aK−1) ofα-discounted optimal actions to be the set of minimizersto the optimization program
📘 Blackwell optimal policy