A Brief Summary of Upper Bounds for Bandit Problems

This post summarizes the regret analysis for multi-armed bandit problems and linear bandits problems.

Specifically, the Exploration-first / epsilon-greedy algorithm achieves $\tilde{O}(K^{1/3}T^{2/3})$ regret and UCB obtains $\tilde{O}(\sqrt{KT})$ regret where $K$ is the numberof arms for MAB. For linear bandits, LinUCB obtains $\tilde{O}(\sigma d\sqrt{T})$ regret where $d$ is the feature dimension. Let us start with the multi-armed bandit (MAB) problems!

1. Multi-armed bandit (MAB) problems

In MAB, we have $K$ actions (the “arms”) and when we play arm $i \in \{1, 2,\ldots, K \}$ we obtain a random reward $r_i$ which has mean reward:

\[\mathbb{E}[r_i]=\mu_i, \quad |\mu_i|\leq 1.\]

Every iteration $t$, the learner will pick an arm $A_t \in [1, 2, \ldots, K ]$. The regret is defined as:

\[R_T=T\cdot \max_i \mu_i-\sum_{t=1}^T \mu_{A_t}.\]

The goal is to minimize the regret.

1.1 Exploration-first algorithm

Algorithm 1: Exploration-first

Spend the first N step exploring, where each action is played for N/K times. The corresponding estimates for each action a is: $$ \widehat{Q}_t(a)=\frac{\sum_{i=1}^{t}R_i\cdot \mathbf{1}[A_i=a]}{\sum_{i=1}^t \mathbf{1}[A_i=a]}, \quad a\in[K] $$
For t = N+1,...,T: $$ A_t:=\text{argmax}_a \widehat{Q}_t(a) $$

Analysis of the Exploration-first algorithm

$\textbf{Step1.}$ For any $t\geq N$, by Hoeffding’s inequality and an union bound, w.p. $1-\delta$

\[\sup_{a\in[K]}|\widehat{Q}(a)-\mu_a|\leq \sqrt{\frac{K}{2N}\log(2K/\delta)}:=\epsilon\]

$\textbf{Step2.}$ Regret for the Exploration phase:

\[R_{1:N}\leq \frac{N}{K}\sum_{a\in[K]}(\max_{a'}\mu_{a'}-\mu_a)\leq N\]

$\textbf{Step3.}$ Regret for the Exploitation phase $A_t\equiv \hat{a}^\star=\text{argmax}_a \widehat{Q}(a)$:

\[\begin{aligned} & R_{N+1:T}\leq (T-N)\cdot (\mu_{a^\star}-\mu_{\hat{a}^\star})\\ =&(T-N)[\mu_{a^\star}-\widehat{Q}(a^\star)+\widehat{Q}(a^\star)-\widehat{Q}(\hat{a}^\star)+\widehat{Q}(\hat{a}^\star)-\mu_{\hat{a}^\star}]\\ \leq &(T-N)[\epsilon+0+\epsilon] \end{aligned}\]

$\textbf{Step4.}$ The total regret is

\[R_T=N+2T\sqrt{\frac{K}{2N}\log(2k/\delta)}=O(T^{2/3}K^{1/3}(\log(2K/\delta))^{1/3})\]

where the last equal sign chooses $N=T^{2/3}K^{1/3}(\log(2K/\delta))^{1/3}$.

1.2 Epsilon-greedy algorithm

Algorithm 2: Epsilon-greedy

Let the strategy be:

With probability $\epsilon$, choose the action uniformly at random;
With probability $1-\epsilon$, select

\[A_t:=\text{argmax}_{a\in[K]} \widehat{Q}_t(a),\]

where $\widehat{Q}_t$ is defined the same as Algorithm 1.

It can be shown the regret bound is

\[\underbrace{\epsilon T}_{\text{Exploration}}+\underbrace{\sum_{t=1}^T C\sqrt{\frac{K}{\epsilon t}}}_{\text{Exploitation}}\]

choose $\epsilon=T^{-1/3}K^{1/3}$ gives regret $\tilde{O}(T^{2/3}K^{1/3})$.

1.3 Upper Condifence Bound (UCB) Algorithm

\[\textbf{Optimism in the face of uncertainty: UCB}\]

Algorithm 3: UCB

Play each action $a\in[K]$ once (in total $K$ steps);
For $t=k+1,\ldots,T$
- Choose $A_t:=\text{argmax}_a \widehat{Q}_t(a)+\sqrt{\frac{2\log(2TK/\delta)}{2N_t(a)}};$
- where $\widehat{Q}_t(a)=\frac{1}{N_t(a)}(R_a+\sum_{i=k+1}^{t-1} \mathbf{1}[A_i=a]\cdot R_i)$, $N_t(a)=\sum_{i=1}^{t-1}\mathbf{1}[A_i=a]$.

Regret analysis: non-adaptive bound

By Azuma-Hoeffding’s inequality and an union bound, w.p. $1-\delta$,

\[|R_a-\mu_a+\sum_{i=k+1}^{t-1} \mathbf{1}[A_i=a]\cdot(R_i-\mu_a)|\leq \sqrt{2 N_t(a)\log(KT/\delta)},\quad \forall a\in[K],t\in[k+1,T]\]

Note the above is equivalent to

\[\begin{aligned} &\sup_{t,a}\frac{1}{\sqrt{N_t(a)}}|R_a-\mu_a+\sum_{i=k+1}^{t-1} \mathbf{1}[A_i=a]\cdot(R_i-\mu_a)|\leq \sqrt{2 \log(KT/\delta)}\\ \Leftrightarrow & \sup_{t,a} \sqrt{N_t(a)}\cdot |\widehat{Q}_t(a)-\mu_a|\leq \sqrt{2\log(KT/\delta)}\\ \Leftrightarrow & |\widehat{Q}_t(a)-\mu_a|\leq \sqrt{\frac{2\log(KT/\delta)}{N_t(a)}},\quad \forall a\in[K],t\\ \end{aligned}\]

Recall $\bar{Q}_t(a)=\widehat{Q}_t(a)+\sqrt{\frac{2\log(2KT/\delta)}{N_t(a)}}$, then at each time $t$,

\[\begin{aligned} &\mu_{a^\star}-\mu_{A_t}\\ =&\mu_{a^\star}-\bar{Q}_t(a^\star)+\bar{Q}_t(a^\star)-\bar{Q}_t(A_t)+\bar{Q}_t(A_t)-\mu(A_t)\\ \leq& 0+0+2 \sqrt{\frac{2\log(KT/\delta)}{N_t(a)}} \end{aligned}\]

which uses optimism and the UCB rule. Then the total regret is bounded by

\[\begin{aligned} R_T=&R_{1:K}+R_{K+1:T}\\ \leq& K+\sum_{t=K+1}^T (\mu_{a^\star}-\mu_{A_t})\\ \leq & K+ \sum_{t=K+1}^T 2 \sqrt{\frac{2\log(KT/\delta)}{N_t(a)}}\\ \leq & K+2 \sqrt{2\log(KT/\delta)}\sum_{a=1}^K\sum_{i=1}^{N_T(a)}\frac{1}{\sqrt{i}}\\ \leq & K +4 \sqrt{2\log(KT/\delta)}\sum_{a=1}^K \sqrt{N_T(a)} \quad \text{eqn} (\star)\\ \leq & K +4 \sqrt{2\log(KT/\delta)}\sqrt{K\cdot \sum_{a=1}^K N_T(a)}\\ = & K +4 \sqrt{2KT\log(KT/\delta)}\\ \end{aligned}\]

Regret analysis: gap-dependent expression

Define the gap $\Delta_a:= \mu_{a^\star}-\mu_a$. By the concentration result and the UCB rule, we know

\[\begin{aligned} \bar{Q}_t(a)\leq& \mu_a+\sqrt{\frac{2\log(2TK/\delta)}{N_t(a)}}\\ =&\mu_{a^\star}-\Delta_a+\sqrt{\frac{2\log(2TK/\delta)}{N_t(a)}}\\ \Leftrightarrow& \mu_{a^\star}-\bar{Q}_t(a)\geq \Delta_a-\sqrt{\frac{2\log(2TK/\delta)}{N_t(a)}} \end{aligned}\]

Note when $\mu_{a^\star}-\bar{Q}_t(a)\geq 0$, then arm $a$ will never be played again (since $\bar{Q}_t(a^\star)$ always upper bounds $\bar{Q}_t(a)$) and $N_t(a)$ will no longer change! Therefore from above we always have

\[0\geq \Delta_a-\sqrt{\frac{2\log(2TK/\delta)}{N_T(a)}}\Leftrightarrow N_T(a)\leq \frac{2\log(2TK/\delta)}{\Delta_a^2}\]

The former non-adaptive bound can be replaced by

\[R_T=\sum_{a=1}^K \Delta_a+{O}(\sum_{a \neq a^\star} \frac{1}{\Delta_a}\log(KT/\delta))\]

$\textbf{Short note:}$ for the non-stochastic bandit setting (e.g. adversarial setting), there are algorithms (e.g. EXP3) that achieves the same regret. See Peter Auer et al. (2001).

2. Linear Bandits

The problem setup

Action space is a compact set $x_t\in \mathcal{D}\subset R^d$;
Reward is linear with i.i.d mean-zero $\sigma^2$-subguassian noise:

$r_t=\mu^\star\cdot x_t+\eta_t,\quad \mathbb{E}[r_t|x_t=x]=\mu^\star\cdot x\in[-1,1];$

Agent chooses a sequence of actions $x_1,\ldots,x_T$;
Let $x^\star\in \text{argmax}_{x\in\mathcal{D}}\mu^\star\cdot x$, then the regret is defined as

\[R_T=T\cdot \langle \mu^\star,x^\star \rangle-\sum_{t=1}^T \langle \mu^\star,x_t \rangle.\]

2.1. LinUCB Algorithm

For each time $t$, with collected data $(x_i,r_i)$ for all $i=1,\ldots,t-1$

Compute the estimated $\widehat{\mu}_t$ through ridge regression:

\[\widehat{\mu}_t:=\text{argmin}_\theta \sum_{i=1}^{t-1}(x_i^\top \theta-r_i)^2+\lambda ||\theta||^2_2\]

Define $\Sigma_t=\lambda I +\sum_{i=1}^{t-1}x_i x_i^\top$, then $\widehat{\mu}_t:=\Sigma^{-1}_t\sum_{i=1}^{t-1}r_i x_i$

Construct high probability confidence ellipsoid of the parameter

\[\text{Ball}_t=\{\mu| (\mu-\widehat{\mu}_t)^\top \Sigma_t(\mu-\widehat{\mu}_t)\leq \beta_t\}\]

Choose actions that maximize the UCB

\[x_t=\text{argmax}_{x\in\mathcal{D}}\max_{\mu\in\text{Ball}_t}\langle x,\mu\rangle\]

$\textbf{Note:}$ the computation of LinUCB could be NP-hard though.

Theorem (Upper bound of LinUCB) Choose $\beta_t=\tilde{O}(\sigma^2 \cdot d)$, suppose $\lVert\mu^\star\rVert\leq W$, $\lVert x\rVert\leq B$ for all $x\in\mathcal{D}$. Then set $\lambda=\sigma^2/W^2$, w.p. $1-\delta$,

\[R_T\leq C\sigma \sqrt{T}(d\log(1+\frac{TB^2W^2}{d\sigma^2})+\log(4/\delta)).\]

2.2 Analysis of the LinUCB algorithm

The analysis is based on several lemmas.

Lemma 1: “Width” of confidence ball Let $x\in\mathcal{D}$. If $\mu\in\text{Ball}_t$, then

\[|(\mu-\widehat{\mu}_t)^\top x|\leq\sqrt{\beta_tx^\top \Sigma^{-1}_t x}\]

Lemma 2: Instaneous regret is bounded Fix $t\leq T$ and define $w_t:=\sqrt{x_t^\top \Sigma^{-1}_t x_t}$. If $\mu^\star\in\text{Ball}_t$, then

\[\text{regret}_t\leq 2\min(\sqrt{\beta_t}w_t,1)\leq 2\sqrt{\beta_T}\min(w_t,1).\]

Lemma 3: “Geometric potential” argument We have:

\[\text{det}\Sigma_T=\text{det}\Sigma_0 \cdot \prod_{t=0}^{T-1}(1+w_t^2).\]

$\textbf{Note:}$ The proof of this lemma is intersting, see Lemma 5.9 of AJKS.

Lemma 4 For any sequence $x_0,\ldots,x_{T-1}$ such that for $t<T$, $\lVert x_t \rVert_2\leq B$, then

\[\log(\text{det}\Sigma_{T-1}/\text{det}\Sigma_0)=\log \text{det}(I+\frac{1}{\lambda}\sum_{t=0}^{T-1}x_tx_t^\top)\leq d\log(1+\frac{TB^2}{d\lambda}).\]

Combine Lemma 1-4, we have for LinUCB if $\mu^\star\in\text{Ball}_t$ for all $t$, then

\begin{equation}\label{eqn:l2norm} \sum_{t=0}^{T-1} \text{regret}_{t}^{2} \leq 4 \beta_T d\log(1+\frac{TB^2}{d\lambda}) \end{equation}

Also, it can be shown for $\delta>0$, $\mathbb{P}(\forall t,\mu^\star\in\text{Ball}_t)\geq 1-\delta$.

Combine \eqref{eqn:l2norm} and the above we can show w.p. $1-\delta$,

\[R_{T}=\sum_{t=0}^{T-1} \text {regret}_{t} \leq \sqrt{T \sum_{t=0}^{T-1} \text { regret }_{t}^{2}} \leq \sqrt{4 T \beta_{T} d \log \left(1+\frac{T B^{2}}{d \lambda}\right)}\]

which finishes the proof.

The content of this post mainly comes from the Bandit Algorithms Book and AJKS.