Indexability of Restless Bandit Problems and Optimality of Whittle's Index for Dynamic Multichannel Access
We consider a class of restless multi-armed bandit problems (RMBP) that arises in dynamic multichannel access, user/server scheduling, and optimal activation in multi-agent systems. For this class of RMBP, we establish the indexability and obtain Whittle’s index in closed-form for both discounted and average reward criteria. These results lead to a direct implementation of Whittle’s index policy with remarkably low complexity. When these Markov chains are stochastically identical, we show that Whittle’s index policy is optimal under certain conditions. Furthermore, it has a semi-universal structure that obviates the need to know the Markov transition probabilities. The optimality and the semi-universal structure result from the equivalency between Whittle’s index policy and the myopic policy established in this work. For non-identical channels, we develop efficient algorithms for computing a performance upper bound given by Lagrangian relaxation. The tightness of the upper bound and the near-optimal performance of Whittle’s index policy are illustrated with simulation examples.
Main
Contribution
- establish the indexability and obtain Whittle’s index in closed-form
- WI policy 在特定条件(仅 relaxed constraint)下的最优性
- equivalency between Whittle’s index policy and the myopic policy
- 在 stochastically identical arms 下的鲁棒性(无需具体转移概率数值,仅需
的大小比较)
Details
Introduction
-
Restless Multi-armed Bandit Process is a generalization of the classical Multi-armed Bandit Processes
-
Whittle’s index policy is the optimal solution to RMBP under a relaxed constraint: the number of activated arms can vary over time provided that its average over the infinite horizon equals to
. -
Dynamic Multichannel Access
probing N independent Markov chains. The objective is to design an optimal policy that governs the selection of K chains at each time to maximize the long-run reward.
The above general problem arises in a wide range of communication systems, including cognitive radio networks, downlink scheduling in cellular systems, opportunistic transmission over fading channels, and resource-constrained jamming and anti-jamming.
-
when arms are stochastically identical, the approximation factor of Whittle’s index policy
when and at least when . -
algorithm runs in at most
time to compute the performance upper bound -
when arms are stochastically identical, Whittle’s index policy has a semi-universal structure that obviates the need to know the Markov transition probabilities. The only required knowledge about the Markovian model is the order of
and .
This semiuniversal structure reveals the robustness of Whittle’s index policy against model mismatch and variations.
PROBLEM STATEMENT AND RESTLESS BANDIT FORMULATION
independent Gilbert-Elliot channels, each with transmission rate denote the set of channels chosen in slot .
- policy
: is a function that maps from the belief vector to the action in slot . - performance measure
expected total discounted reward over the infinite horizon : denote the RMBP with the average reward criterion
这里考虑的是最大化 reward 而非最小化 cost,因此后面的 subsidy 也是用于补助 passive arm
INDEXABILITY AND INDEX POLICIES
An index policy assigns an index for each state of each arm to measure how rewarding it is to activate an arm at a particular state.
A myopic policy is a simple example of strongly decomposable index policies. This policy ignores the impact of the current action on the future reward, focusing solely on maximizing the expected immediate reward.
, the value function represents the maximum expected total discounted reward that can be accrued from a single-armed bandit process with subsidy when the initial belief state is . - 动态规划
- The optimal policy is essentially given by an optimal partition of the state space
into a passive set and an active set , where denotes the optimal action under belief state . - The passive set
under subsidy
- Whittle’s index policy achieves a near-optimal performance while the myopic policy suffers from a significant performance loss.
WHITTLE’S INDEX UNDER DISCOUNTED REWARD CRITERION
- Properties of Belief State Transition
convergence ofto the stationary distribution for Positively correlated channel ( ) and Negatively correlated channel ( )
The optimal policy for the single-armed bandit process with subsidy
and
- closed-form expression
where is the minimum amount of time required for a passive arm to transit across starting from . - Whittle's Index
Whittle’s index is the subsidythat is the solution to the following equation of is a monotonically increasing function of . - For a positively correlated channel (
), is piecewise concave with countable pieces. - For a negatively correlated channel (
), is piecewise convex with finite pieces.
- Performance
- optimal policy for RMBP under the relaxed constraint.
- upper bound of the optimal performance
- Algorithm with complexity
WHITTLE’S INDEX POLICY FOR STOCHASTICALLY IDENTICAL CHANNELS
Based on the equivalency between Whittle’s index policy and the myopic policy for stochastically identical arms, we can analyze Whittle’s index policy by focusing on the myopic policy which has a much simpler index form.
- Simplicity of Whittle's index in this case
- channel selection is reduced to maintaining a simple queue structure that requires no computation and little memory.
- a semi-universal structure; it can be implemented without knowing the channel transition probabilities except the order of
and . - Shows robust
- Performance
- For negatively correlated channels, Whittle’s index policy achieves at least half the optimal performance.
- For positively correlated channels, the approximation factor can be further improved under certain conditions on the transition probabilities. Specifically, we have
.
Notes
Comment: submitted to IEEE Transactions on Information Theory
2024-09-24 | Zotero
Memo
-
📗 (RMBP
-
📗 dynamicmultichannel access,
-
📗 we establish the indexability and obtain Whittle’s index in closed-form for both discountedand average reward criteria
-
🏷 When these Markov chains are stochastically identical, we show that Whittle’s index policy is optimal under certain conditions
-
📗 optimality and the semi-universalstructure result
-
📗 For non-identical channels, we develop efficient algorithms for computing a performance upper bound given by Lagrangian relaxation
-
📘 Restless Multi-armed Bandit Process (RMBP) is a generalization of the classical Multi-armedBandit Processes (MBP)
-
📘 Closed-form Expression of