Indexability of Restless Bandit Problems and Optimality of Whittle's Index for Dynamic Multichannel Access

#论文/ComputerScience-InformationTheory #WhittleIndex #MAB

论文链接: Indexability of Restless Bandit Problems and Optimality of Whittle's Index for Dynamic Multichannel Access

Abstract

We consider a class of restless multi-armed bandit problems (RMBP) that arises in dynamic multichannel access, user/server scheduling, and optimal activation in multi-agent systems. For this class of RMBP, we establish the indexability and obtain Whittle’s index in closed-form for both discounted and average reward criteria. These results lead to a direct implementation of Whittle’s index policy with remarkably low complexity. When these Markov chains are stochastically identical, we show that Whittle’s index policy is optimal under certain conditions. Furthermore, it has a semi-universal structure that obviates the need to know the Markov transition probabilities. The optimality and the semi-universal structure result from the equivalency between Whittle’s index policy and the myopic policy established in this work. For non-identical channels, we develop efﬁcient algorithms for computing a performance upper bound given by Lagrangian relaxation. The tightness of the upper bound and the near-optimal performance of Whittle’s index policy are illustrated with simulation examples.

Main

Contribution

establish the indexability and obtain Whittle’s index in closed-form
WI policy 在特定条件(仅 relaxed constraint)下的最优性
equivalency between Whittle’s index policy and the myopic policy
在 stochastically identical arms 下的鲁棒性（无需具体转移概率数值，仅需 $p_{11}, p_{01}$ 的大小比较）

Details

Introduction

Restless Multi-armed Bandit Process is a generalization of the classical Multi-armed Bandit Processes
Whittle’s index policy is the optimal solution to RMBP under a relaxed constraint: the number of activated arms can vary over time provided that its average over the infinite horizon equals to $K$ .
Dynamic Multichannel Access
probing N independent Markov chains. The objective is to design an optimal policy that governs the selection of K chains at each time to maximize the long-run reward.

The above general problem arises in a wide range of communication systems, including cognitive radio networks, downlink scheduling in cellular systems, opportunistic transmission over fading channels, and resource-constrained jamming and anti-jamming.
when arms are stochastically identical, the approximation factor of Whittle’s index policy $ρ^{W I} \geq K / N$ when $p_{11} \geq p_{01}$ and at least $max {1 / 2, K / N}$ when $p_{11} < p_{01}$ .
algorithm runs in at most $O (N (l o g N)^{2})$ time to compute the performance upper bound
when arms are stochastically identical, Whittle’s index policy has a semi-universal structure that obviates the need to know the Markov transition probabilities. The only required knowledge about the Markovian model is the order of $p_{11}$ and $p_{01}$ .
This semiuniversal structure reveals the robustness of Whittle’s index policy against model mismatch and variations.

PROBLEM STATEMENT AND RESTLESS BANDIT FORMULATION

$N$ independent Gilbert-Elliot channels, each with transmission rate $B_{i}$
$U (t)$ denote the set of $k$ channels chosen in slot $t$ .

ω_{i} (t + 1) = {\begin{cases} p_{11}^{(i)}, & i \in U (t), S_{i} (t) = 1 \\ p_{01}^{(i)}, & i \in U (t), S_{i} (t) = 0 \\ T (ω_{i} (t)), & i \notin U (t) \end{cases}

T (ω_{i} (t)) ≜ ω_{i} (t) p_{11}^{(i)} + (1 - ω_{i} (t)) p_{01}^{(i)}

$Ω (t) ≜ [ω_{1} (t), \dots, ω_{N} (t)]$
policy $π$ : $Ω (t) \to U (t)$ is a function that maps from the belief vector $Ω (t)$ to the action $U (t)$ in slot $t$ .
performance measure
expected total discounted reward over the infinite horizon $E_{π} [Σ_{t = 1}^{\infty} β^{t - 1} R_{π (Ω (t))} (t) | Ω (1)]$
$(Ω (1), {P_{i}}_{i = 1}^{N}, {B_{i}}_{i = 1}^{N}, β = 1)$ : denote the RMBP with the average reward criterion

这里考虑的是最大化 reward 而非最小化 cost，因此后面的 subsidy 也是用于补助 passive arm

INDEXABILITY AND INDEX POLICIES

An index policy assigns an index for each state of each arm to measure how rewarding it is to activate an arm at a particular state.
A myopic policy is a simple example of strongly decomposable index policies. This policy ignores the impact of the current action on the future reward, focusing solely on maximizing the expected immediate reward.

$V_{β, m} (w)$ , the value function represents the maximum expected total discounted reward that can be accrued from a single-armed bandit process with subsidy $m$ when the initial belief state is $w$ .
动态规划

\begin{aligned} V_{β, m} (ω; u = 0) = & m + β V_{β, m} (T (ω)), \\ V_{β, m} (ω; u = 1) = & ω + β (ω V_{β, m} (p_{11}) + (1 - ω) V_{β, m} (p_{01})) . \end{aligned}

The optimal policy is essentially given by an optimal partition of the state space $[0, 1]$ into a passive set ${w : u^{*} (w) = 0}$ and an active set ${w : u^{*} (w) = 1}$ , where $u^{*} (w)$ denotes the optimal action under belief state $w$ .
The passive set $P (m)$ under subsidy $m$

\begin{aligned} P (m) & = {ω : u_{m}^{*} (ω) = 0} \\ = {ω : V_{β, m} (ω; u = 0) \geq V_{β, m} (ω; u = 1)} \end{aligned}

indexable&Whittle's index定义

\begin{aligned} W (ω) & = inf_{m} {m : u_{m}^{*} (ω) = 0} \\ = inf_{m} {m : V_{β, m} (ω; u = 0) = V_{β, m} (ω; u = 1)} \end{aligned}

Whittle’s index policy achieves a near-optimal performance while the myopic policy suffers from a significant performance loss.

WHITTLE’S INDEX UNDER DISCOUNTED REWARD CRITERION

Properties of Belief State Transition
convergence of $T^{k} (w)$ to the stationary distribution $w_{0} = p_{01} / (p_{01} + p_{10})$ for Positively correlated channel ( $p_{11} \geq p_{01}$ ) and Negatively correlated channel ( $p_{11} \leq p_{01}$ )

threshold policy

The optimal policy for the single-armed bandit process with subsidy $m$ is a threshold policy, i.e., there exists an $w_{β}^{*} (m) \in R$ such that

u_{m}^{*} (ω) = {\begin{cases} 1 & if ω > ω_{β}^{*} (m) \\ 0 & if ω \leq ω_{β}^{*} (m) \end{cases}

and $V_{β, m} (ω_{β}^{*} (m); u = 0) = V_{β, m} (ω_{β}^{*} (m); u = 1)$

closed-form expression $V_{β, m} (ω) = \frac{1 - β^{L (ω, ω_{β}^{*} (m))}}{1 - β} m + β^{L (ω, ω_{β}^{*} (m))} V_{β, m} (T^{L (ω, ω_{β}^{*} (m))} (ω); u = 1)$ where $L (w, w^{'})$ is the minimum amount of time required for a passive arm to transit across $w^{'}$ starting from $w$ . $L (ω, ω^{'}) = {\begin{cases} 0, & if ω > ω^{'} \\ 1, & if ω \leq ω^{'} and T (ω) > ω^{'} \\ \infty, & if ω \leq ω^{'} and T (ω) \leq ω^{'} \end{cases}$
Whittle's Index
Whittle’s index is the subsidy $m$ that is the solution to the following equation of $m$ $\underset{V_{β, m} (ω; u = 1)}{\underset{⏟}{ω + β (ω V_{β, m} (p_{11}) + (1 - ω) V_{β, m} (p_{01}))}} = \underset{V_{β, m} (ω; u = 0)}{\underset{⏟}{m + β V_{β, m} (T^{1} (ω))}}$
- $W_{β} (w)$ is a monotonically increasing function of $w$ .
- For a positively correlated channel ( $p_{11} \geq p_{01}$ ), $W_{β} (w)$ is piecewise concave with countable pieces.
- For a negatively correlated channel ( $p_{11} \leq p_{01}$ ), $W_{β} (w)$ is piecewise convex with finite pieces.

closed-form whittle's index

img-20241014152231948|500

Performance
- optimal policy for RMBP under the relaxed constraint.
- upper bound of the optimal performance
  $m^{*} = sup {m : \frac{d (G_{β, m} (Ω (1)))}{d m} = Σ_{i = 1}^{N} D_{β, m}^{(i)} (ω_{i} (1)) - \frac{(N - K)}{1 - β} \leq 0}$
  $\begin{array}{r} G_{β, m^{'}} (Ω (1))) - G_{β, m^{*}} (Ω (1)) \leq \frac{K}{1 - β} (m^{'} - m^{*}) \leq \frac{δ K}{1 - β} = ϵ \end{array}$
Algorithm with complexity $O (N^{2} \log N)$

WHITTLE’S INDEX POLICY FOR STOCHASTICALLY IDENTICAL CHANNELS

Based on the equivalency between Whittle’s index policy and the myopic policy for stochastically identical arms, we can analyze Whittle’s index policy by focusing on the myopic policy which has a much simpler index form.

Simplicity of Whittle's index in this case
- channel selection is reduced to maintaining a simple queue structure that requires no computation and little memory.
- a semi-universal structure; it can be implemented without knowing the channel transition probabilities except the order of $p_{11}$ and $p_{01}$ .
- Shows robust
Performance
- For negatively correlated channels, Whittle’s index policy achieves at least half the optimal performance.
- For positively correlated channels, the approximation factor can be further improved under certain conditions on the transition probabilities. Specifically, we have $η \geq 1 - p_{11} + w_{0}$ .

Notes

Comment: submitted to IEEE Transactions on Information Theory
2024-09-24 | Zotero

Memo

📗 (RMBP
📗 dynamicmultichannel access,
📗 we establish the indexability and obtain Whittle’s index in closed-form for both discountedand average reward criteria
🏷 When these Markov chains are stochastically identical, we show that Whittle’s index policy is optimal under certain conditions
📗 optimality and the semi-universalstructure result
📗 For non-identical channels, we develop efficient algorithms for computing a performance upper bound given by Lagrangian relaxation
📘 Restless Multi-armed Bandit Process (RMBP) is a generalization of the classical Multi-armedBandit Processes (MBP)
📘 Closed-form Expression of