Universal Reinforcement Learning

论文链接: Universal Reinforcement Learning

Notes

Main

Contribution

Details

lim supTE[1Tt=1Tg(Xt,At,Xt+1)] Pr(Xt=xt|Ft1)=P(xt|XtKt1,AtKt1) λ(xK,aK1)=infνlim supTEν[1Tt=1Tg(Xt,At,Xt+1)|xK,aK1] J(xK,aK1)=minaKxK+1P(xK+1|xK,aK)×[g(xk,ak,xK+1)+αJ(x2K+1,a2K)] P^(x+1|x,a)=N(x+1,a)+1/2xN((x,x),a)+|X|/2

N(xl+1,al) is the number of times the context (xl+1,al) has been visited prior to time t

img-20241127212413017|600
img-20241127212329109|500

Performance of the active LZ algorithm on Rock-Paper-Scissors relative to the predictive LZ algorithm and the optimal policy.

Memo


© 2024 LiQ :) 由 Obsidian&Github 强力驱动