Talk Abstract: The Whittle index policy is a heuristic that has shown remarkably good performance and guaranteed asymptotic optimality when applied to the class of hard problems known as Restless Multi-Armed Bandit Problems (RMABPs). Some examples of applications of RMABPs are: machine maintenance, wireless channel scheduling, A/B testing and clinical trials, just to name a few. RMABP provides a classical example when a decision-maker needs to balance between exploration and exploitation. We present two approaches (tabular and neural network based) for learning the Whittle indices. The key feature of our approaches is the usage of two time-scales, a faster one to update the state-action Q-values, and a relatively slower one to update the Whittle indices. The neural network based approach computes the Q-values on the faster time-scale and is able to extrapolate information from one state to another, which makes the approach naturally scalable to environments with large state spaces. We present both the theoretical convergence analysis as well as illustrations by numerical examples.

Recent Posts