The Gittins Index

A Multi-arm bandit problem is a sequential decision making problem where at each time step one must select among $n$ choices (arms) that yield some unknown reward. Markovian multi-arm bandits are those processes in which the rewards and states of the arms evolve according to a Markov process. In the Markovian framework, the Gittins index theorem proves that the optimal policy, with respect to expected total-discounted rewards, is to pull the arm with the highest Gittins index at each time step. This remarkable result transformed a hitherto problem solved only in exponential time to one that can be solved in polynomial time. In this paper, we further elucidate a proof of the Gittins index theorem and discuss its implications for Markovian multi-arm bandits.

Introduction

“The Multi-Arm Bandit problem was formulated during the war, and efforts to solve it so sapped the energies and minds of Allied scientists that the suggestion was made that the problem be dropped over Germany, as the ultimate instrument of intellectual sabotage.” - Peter Whittle

Problem Description

Consider a student who is faced with two assignments due the following morning. The first assignment is a topic paper worth 10\% of the final grade, and the second assignment is a problem set worth roughly 3\% of the final grade. The student is uncertain about how much time each assignment will take, and thus must decide how to properly allocate his time. The student could spend all of his time on the topic paper, but then he risks not completing the problem set. Alternatively, he could finish the problem set but not leave enough time to write an adequate paper. Perhaps, he could work a little on each assignment, and reallocate his time as he learns more about the difficulty of each assignment. This is the essence of the multi-arm bandit problem.


Definition: A Markovian multi-armed bandit process is a tuple $(\mathcal{A}, \mathcal{S}, \mathcal{P}, R, \gamma)$, where:


Note that $s = (s^{(1)}, s^{(2)}, \ldots, s^{(n)}) \in \mathcal{S}$ is the global state of the bandit process. The process evolves roughly as follows: at decision time $t$, we choose to pull arm $i$, which is in state $s_t^{(i)} \in I_i$, we then collect reward $r_i(s_t^{(i)})$. Next, $A_i$ transitions to state $s_{t+1}^{(i)}$ according to $P_i(s_{t+1}^{(i)} \mid s_t^{(i)})$. The other arms do not change states, that is, $s_{t+1}^{(j)} = s_{t}^{(j)}$ for all $j \neq i$.

Let $i_t$ denote the arm that is chosen at time $t$. In general, we aim to select a policy $\pi: \mathcal{S} \to {1, \dots, n}$ such that the value function

\[V^\pi(s) = \mathbb{E}_{\pi}\left[\sum_{t=0}^\infty \gamma^t r_{i_t}(s_t^{(i_t)}) \mid s_0 = s\right]\]

is maximized for all $s \in \mathcal{S}$. That is $\pi$ chooses which arm $A_i$ to pull given the current global state $s_t$. Under $\pi$, we maximize the expected total discounted reward.

Importance of the Problem

The dilemma the student aboves faces notwithstanding, there are many real-world applications of the multi-arm bandit problem.

Other than the practical applications, the multi-arm bandit problem is also of theoretical interest. As mentioned, prior to the Gittins index theorem, the multi-arm bandit problem was considered computationally intractable. To see this, consider the dynamic programming solution

\[V^\pi(s) = \max_{i} \mathbb{E} \left [r_i(s^{(i)}) + \gamma \sum_{s' \in \mathcal{S}} P_i(s^{(i)}, s')V^\pi(s')\right]\]

Suppose $k = \prod_{i=1}^n \mid I_i\mid$ is the number of states in the Markovian multi-arm bandit. Since $s = (s^{(1)}, s^{(2)}, \ldots, s^{(n)})$, the number of possible policies is $n^k$. Thus, the dynamic programming solution is exponential in the number of states. The Gittins index theorem reduces this problem by “solving” each arm independently, and then combining the solutions, which can be done in polynomial time.

Gittins Index Theorem

Definition For any discounted Markovian multi-arm bandit, the Gittins index is defined by

\[\begin{align} G_j(s^{(j)}) &= \sup\left \{ \textcolor{blue}{\alpha} : \sup_{\tau} \mathbb{E}\left[\sum_{t=0}^{\tau - 1} \gamma^t \left [r_j\left (s_t^{(j)} \right ) -\textcolor{blue}{\alpha} \right ] \mid s_0^{(j)} = s^{(j)}\right] \geq 0 \right \} \\ \end{align}\]

where $\tau \geq 1$ is a stopping time.


Let us provide some motivation for this defintion. Suppose we have a simple bandit process $S(\alpha)$ that has one state and pays $\alpha$ each time we pull the arm. If we pull the arm at each time step, the the value of the process is

\[\alpha\sum_{t=0}^\infty \gamma^t = \frac{\alpha}{1 - \gamma}\]

Now, let’s introduce a competing arm, $A_i$, and suppose $r_i \geq 0$ and $\sup r_i < \infty$. Suppose we start by pulling this arm at time $0$ and follow an optimal policy thereafter. The policy may decide to pull the arm $S(\alpha)$ at some future decision time $\tau$. If this occurs, the information about $A_i$ at $\tau + 1$ is the same as at time $\tau$, and thus it must be optimal to continue to pull $S(\alpha)$ for all times $t \geq \tau$. Thus, the maximal payoff is

\[\sup_{\tau > 0} \mathbb{E} \left [ \sum_{t=0}^{\tau - 1} \gamma^t r_i(s_t^{(i)}) + \gamma^\tau \frac{\alpha}{1 - \gamma} \mid s_0^{(i)} = s^{(i)}\right ]\]

The key deduction is to find $\alpha$ such that it is equally optimal to pull $S(\alpha)$ at time $0$ as it is to pull $A_i$ at time $0$. That is, we seek $\alpha$ such that

\[\begin{aligned} 0 &= \sup_{\tau > 0}\mathbb{E} \left [ \sum_{t=0}^{\tau - 1} \gamma^t r_i(s_t^{(i)}) + \gamma^\tau \frac{\alpha}{1 - \gamma} \mid s_0^{(i)} = s^{(i)}\right ] - \frac{\alpha}{1 - \gamma} \\ &=\sup_{\tau > 0} \mathbb{E} \left [ \sum_{t=0}^{\tau - 1} \gamma^t r_i(s_t^{(i)}) -(1- \gamma^\tau) \frac{\alpha}{1 - \gamma} \mid s_0^{(i)} = s^{(i)}\right ] \\ &=\sup_{\tau > 0} \mathbb{E} \left [ \sum_{t=0}^{\tau - 1} \gamma^t r_i(s_t^{(i)}) -\alpha\sum_{t = 0}^{\tau - 1} \gamma^t \mid s_0^{(i)} = s^{(i)}\right ] \quad \text{finite geometric series} \\ &=\sup_{\tau > 0} \mathbb{E} \left [ \sum_{t=0}^{\tau - 1} \gamma^t \left [r_i(s_t^{(i)}) -\alpha \right ] \mid s_0^{(i)} = s^{(i)}\right ] \end{aligned}\]

For fixed $\tau$, this is a decreasing linear function of $\alpha$, thus the supremum is a decreasing convex function of $\alpha$. Moreover, since $r_i$ is bounded, we know that a root exists, and the convexity and montonicity of the function implies that the root is unique. Finally, we can express this root as

\[\sup \left \{ \alpha : \sup_{\tau} \mathbb{E} \left [ \sum_{t=0}^{\tau - 1} \gamma^t \left [r_i(s_t^{(i)}) -\alpha \right ] \mid s_0^{(i)} = s^{(i)}\right ] \geq 0 \right \}\]

notice that this is exactly the Gittins index for the arm $A_i$ in state $s$, $G_i(s)$. Thus, the economic interpretation of the Gittins index is the value one would be indifferent to pay to pull the arm optimally or receive (discounted) forever starting at time $0$.


Gittin’s Index Theorem: For any discounted Markovian multi-arm bandit with finitely many arms, and bounded rewards, a policy is optimal if and only if it selects

\[i_t = \arg\max_{j} G_j(s_t^{(j)})\]

at each decision time $t$. That is, the optimal policy selects the arm with the highest Gittins index at each decision time.


There are many proofs of this theorem, but we attempt to formalize the argument given in Weber 1992 by introducing mathematical notation that is absent in the original proof.

First, let us consider a single bandit with arm, $A_i$ which may be pulled or stopped permanently at each decision time $t$. Secondly, let’s introduce a cost for each pull and name it the prevailing charge. Next, we define the fair charge, $\alpha_i(s), s \in I_i$ as the value of the prevailing charge such that one would be indifferent to pulling the arm (and playing optimally thereafter) or stopping permanently. Observe that the fair charge is the Gittins index for the arm $A_i$ at state $s$.

The argument proceeds as follows: Let a gambler begin playing by pulling the arm at time $0$ and set the prevailing charge equal to the fair charge. Eventually, it may become optimal to stop pulling the arm. That is, the prevailing charge exceeds the fair charge. At this point, we reduce the prevailing charge to the new fair charge such that now, the gambler is again playing a fair game, and he is free to continue pulling the arm. Let

\[\{\alpha_i(s_t^{(i)})\}_{t\geq 0}\]

be the sequence of fair charges generated by the arm $A_i$. We continue this process of reducing the prevailing charge to the fair charge whenever the gambler would otherwise stop playing. Then, we can define the sequence of prevailing charges as

\[\alpha_i^\star(s_t^{(i)}) = \min_{u\leq t} \alpha_i(A_u^{(i)})\]

By construction, it is then optimal to play forever, and the expected total profit is 0. However, at $t = 0$ and all times $t$ such that $\alpha_i(s_t^{(i)}) = \alpha_i^\star(s_t^{(i)})$, it is also optimal to stop playing. Clearly, one should not stop playing when the fair charge exceeds the prevailing charge. Thus, we have the following inequality,


Lemma: For any stopping time $T$,

\[\mathbb{E} \left [ \sum_{t=0}^{T - 1} \gamma^t r_i(s_t^{(i)}) | s_0^{(i)} = s^{(i)}\right ] \leq \mathbb{E} \left [ \sum_{t=0}^{T - 1} \gamma^t \alpha_i^\star(s_t^{(i)}) | s_0^{(i)} = s^{(i)}\right ]\]

with equality if and only if $\alpha_i^\star(s_T^{(i)}) = \alpha_i(s_T^{(i)})$.


Proof: Since the prevailing charges are reset to the fair charges every time it becomes suboptimal to play, it suffices to consider the last interval starting at $t < T$ when the prevailing charge was last reduced. Since the prevailing charge is constant on this interval, for simplicity, we can shift the interval such that $t = 0$ and substitute $T = T-t$. Then, $\alpha_i(s_0^{(i)})$ implies an optimal stopping time $T^\star$ such that the gambler’s expected profit is 0. Moreover, at time $0$, $\alpha_i^\star(s_0^{(i)}) = \alpha_i(s_0^{(i)})$. Then, since $T \leq T^\star$ and rewards are non-negative,

\[\begin{aligned} \mathbb{E} \left [ \sum_{t=0}^{T-1} \gamma^t \left [r_i(s_t^{(i)}) - \alpha_i^\star(s_0^{(i)})\right ] | s_0^{(i)} = s^{(i)}\right ] &\leq \mathbb{E} \left [ \sum_{t=0}^{T^\star-1} \gamma^t \left [r_i(s_t^{(i)}) - \alpha_i^\star(s_0^{(i)})\right ] | s_0^{(i)} = s^{(i)}\right ] \\ &= \mathbb{E} \left [ \sum_{t=0}^{T^\star-1} \gamma^t \left [r_i(s_t^{(i)}) - \alpha_i(s_0^{(i)})\right ] | s_0^{(i)} = s^{(i)}\right ] \\ &= 0 \end{aligned}\]

The last inequality follows directly from the definition of the fair charge. Moreover, it is clear that equality holds when $T = T^\star$. The lemma is proved.

This lemma allows us to consider $n$ arms $A_1, \dots, A_n$. Let us set the initial prevailing charge for each arm to its fair charge and consider the $n$ independent sequences $\alpha^\star(s_t^{(1)}), \dots, \alpha^\star(s_t^{(n)})$ of prevailing charges generated by the reduction process described above. Let $\pi$ be the proposed policy. That is, at each decision time $t$, the policy $\pi$ chooses

\[i_t = \arg\max_{j} \alpha_j^\star(s_t^{(j)})\]

To show that the proposed policy is optimal, consider an alternative policy $\sigma$ that chooses indices $j_t$ according to some other rule. By the above lemma, we have,

\[\begin{aligned} \mathbb{E}_{\sigma} \left [ \sum_{t=0}^{\infty} \gamma^t r_{j_t}(s_t^{(j_t)}) | s_0^{(j_0)} = s^{(j_0)} \right ] &\leq \mathbb{E}_{\sigma} \left [ \sum_{t=0}^{\infty} \gamma^t \alpha_{j_t}^\star(s_t^{(j_t)}) | s_0^{(j_0)} = s^{(j_0)}\right ] \\ &\leq \mathbb{E}_{\pi} \left [ \sum_{t=0}^{\infty} \gamma^t \alpha_{i_t}^\star(s_t^{(i_t)}) | s_0^{(i_0)} = s^{(i_0)}\right ] \\ &= \mathbb{E}_{\pi} \left [ \sum_{t=0}^{\infty} \gamma^t r_{i_t}(s_t^{(i_t)}) | s_0^{(i_0)} = s^{(i_0)} \right ] \end{aligned}\]

where the first inequality follows from the lemma and the second inequality follows from the fact that $\pi$ chooses the maximal charge at each decision time. Thus, the policy $\pi$ is optimal and the theorem is proved.