Indexability of Restless Bandit Problems and Optimality of Whittle's Index for Dynamic Multichannel Access

论文链接: Indexability of Restless Bandit Problems and Optimality of Whittle's Index for Dynamic Multichannel Access

Main

Contribution

Details

Introduction

PROBLEM STATEMENT AND RESTLESS BANDIT FORMULATION

ωi(t+1)={p11(i),iU(t),Si(t)=1p01(i),iU(t),Si(t)=0T(ωi(t)),iU(t)T(ωi(t))ωi(t)p11(i)+(1ωi(t))p01(i)

这里考虑的是最大化 reward 而非最小化 cost,因此后面的 subsidy 也是用于补助 passive arm

INDEXABILITY AND INDEX POLICIES

An index policy assigns an index for each state of each arm to measure how rewarding it is to activate an arm at a particular state.
A myopic policy is a simple example of strongly decomposable index policies. This policy ignores the impact of the current action on the future reward, focusing solely on maximizing the expected immediate reward.

Vβ,m(ω;u=0)=m+βVβ,m(T(ω)),Vβ,m(ω;u=1)=ω+β(ωVβ,m(p11)+(1ω)Vβ,m(p01)). P(m)={ω:um(ω)=0}={ω:Vβ,m(ω;u=0)Vβ,m(ω;u=1)} W(ω)=infm{m: um(ω)=0}=infm{m: Vβ,m(ω;u=0)=Vβ,m(ω;u=1)}

WHITTLE’S INDEX UNDER DISCOUNTED REWARD CRITERION

threshold policy

The optimal policy for the single-armed bandit process with subsidy m is a threshold policy, i.e., there exists an wβ(m)R such that

um(ω)={1 if ω>ωβ(m)0 if ωωβ(m)

and Vβ,m(ωβ(m);u=0)=Vβ,m(ωβ(m);u=1)

img-20241014151147301|400
img-20241014151238136|500

img-20241014152632072|600

WHITTLE’S INDEX POLICY FOR STOCHASTICALLY IDENTICAL CHANNELS

Based on the equivalency between Whittle’s index policy and the myopic policy for stochastically identical arms, we can analyze Whittle’s index policy by focusing on the myopic policy which has a much simpler index form.
img-20241014160357252|500

Notes

Comment: submitted to IEEE Transactions on Information Theory
2024-09-24 | Zotero


Memo


© 2024 LiQ :) 由 Obsidian&Github 强力驱动