This post summarizes the regret analysis for multi-armed bandit problems and linear bandits problems.

Specifically, the Exploration-first / epsilon-greedy algorithm achieves \(\tilde{O}(K^{1/3}T^{2/3})\) regret and UCB obtains \(\tilde{O}(\sqrt{KT})\) regret where \(K\) is the numberof arms for MAB. For linear bandits, LinUCB obtains \(\tilde{O}(\sigma d\sqrt{T})\) regret where \(d\) is the feature dimension. Let us start with the multi-armed bandit (MAB) problems!


1. Multi-armed bandit (MAB) problems

In MAB, we have \(K\) actions (the “arms”) and when we play arm \(i \in \{1, 2,\ldots, K \}\) we obtain a random reward \(r_i\) which has mean reward:

\[\mathbb{E}[r_i]=\mu_i, \quad |\mu_i|\leq 1.\]

Every iteration \(t\), the learner will pick an arm \(A_t \in [1, 2, \ldots, K ]\). The regret is defined as:

\[R_T=T\cdot \max_i \mu_i-\sum_{t=1}^T \mu_{A_t}.\]

The goal is to minimize the regret.


1.1 Exploration-first algorithm

Algorithm 1: Exploration-first

  • Spend the first N step exploring, where each action is played for N/K times. The corresponding estimates for each action a is: $$ \widehat{Q}_t(a)=\frac{\sum_{i=1}^{t}R_i\cdot \mathbf{1}[A_i=a]}{\sum_{i=1}^t \mathbf{1}[A_i=a]}, \quad a\in[K] $$
  • For t = N+1,...,T: $$ A_t:=\text{argmax}_a \widehat{Q}_t(a) $$

Analysis of the Exploration-first algorithm

\(\textbf{Step1.}\) For any \(t\geq N\), by Hoeffding’s inequality and an union bound, w.p. \(1-\delta\)

\[\sup_{a\in[K]}|\widehat{Q}(a)-\mu_a|\leq \sqrt{\frac{K}{2N}\log(2K/\delta)}:=\epsilon\]

\(\textbf{Step2.}\) Regret for the Exploration phase:

\[R_{1:N}\leq \frac{N}{K}\sum_{a\in[K]}(\max_{a'}\mu_{a'}-\mu_a)\leq N\]

\(\textbf{Step3.}\) Regret for the Exploitation phase \(A_t\equiv \hat{a}^\star=\text{argmax}_a \widehat{Q}(a)\):

\[\begin{aligned} & R_{N+1:T}\leq (T-N)\cdot (\mu_{a^\star}-\mu_{\hat{a}^\star})\\ =&(T-N)[\mu_{a^\star}-\widehat{Q}(a^\star)+\widehat{Q}(a^\star)-\widehat{Q}(\hat{a}^\star)+\widehat{Q}(\hat{a}^\star)-\mu_{\hat{a}^\star}]\\ \leq &(T-N)[\epsilon+0+\epsilon] \end{aligned}\]

\(\textbf{Step4.}\) The total regret is

\[R_T=N+2T\sqrt{\frac{K}{2N}\log(2k/\delta)}=O(T^{2/3}K^{1/3}(\log(2K/\delta))^{1/3})\]

where the last equal sign chooses \(N=T^{2/3}K^{1/3}(\log(2K/\delta))^{1/3}\).


1.2 Epsilon-greedy algorithm

Algorithm 2: Epsilon-greedy

Let the strategy be:

  • With probability \(\epsilon\), choose the action uniformly at random;

  • With probability \(1-\epsilon\), select

\[A_t:=\text{argmax}_{a\in[K]} \widehat{Q}_t(a),\]

where \(\widehat{Q}_t\) is defined the same as Algorithm 1.

It can be shown the regret bound is

\[\underbrace{\epsilon T}_{\text{Exploration}}+\underbrace{\sum_{t=1}^T C\sqrt{\frac{K}{\epsilon t}}}_{\text{Exploitation}}\]

choose \(\epsilon=T^{-1/3}K^{1/3}\) gives regret \(\tilde{O}(T^{2/3}K^{1/3})\).


1.3 Upper Condifence Bound (UCB) Algorithm

\[\textbf{Optimism in the face of uncertainty: UCB}\]

Algorithm 3: UCB

  • Play each action \(a\in[K]\) once (in total \(K\) steps);
  • For \(t=k+1,\ldots,T\)
    • Choose \(A_t:=\text{argmax}_a \widehat{Q}_t(a)+\sqrt{\frac{2\log(2TK/\delta)}{2N_t(a)}};\)

    • where \(\widehat{Q}_t(a)=\frac{1}{N_t(a)}(R_a+\sum_{i=k+1}^{t-1} \mathbf{1}[A_i=a]\cdot R_i)\), \(N_t(a)=\sum_{i=1}^{t-1}\mathbf{1}[A_i=a]\).

Regret analysis: non-adaptive bound

By Azuma-Hoeffding’s inequality and an union bound, w.p. \(1-\delta\),

\[|R_a-\mu_a+\sum_{i=k+1}^{t-1} \mathbf{1}[A_i=a]\cdot(R_i-\mu_a)|\leq \sqrt{2 N_t(a)\log(KT/\delta)},\quad \forall a\in[K],t\in[k+1,T]\]

Note the above is equivalent to

\[\begin{aligned} &\sup_{t,a}\frac{1}{\sqrt{N_t(a)}}|R_a-\mu_a+\sum_{i=k+1}^{t-1} \mathbf{1}[A_i=a]\cdot(R_i-\mu_a)|\leq \sqrt{2 \log(KT/\delta)}\\ \Leftrightarrow & \sup_{t,a} \sqrt{N_t(a)}\cdot |\widehat{Q}_t(a)-\mu_a|\leq \sqrt{2\log(KT/\delta)}\\ \Leftrightarrow & |\widehat{Q}_t(a)-\mu_a|\leq \sqrt{\frac{2\log(KT/\delta)}{N_t(a)}},\quad \forall a\in[K],t\\ \end{aligned}\]

Recall \(\bar{Q}_t(a)=\widehat{Q}_t(a)+\sqrt{\frac{2\log(2KT/\delta)}{N_t(a)}}\), then at each time \(t\),

\[\begin{aligned} &\mu_{a^\star}-\mu_{A_t}\\ =&\mu_{a^\star}-\bar{Q}_t(a^\star)+\bar{Q}_t(a^\star)-\bar{Q}_t(A_t)+\bar{Q}_t(A_t)-\mu(A_t)\\ \leq& 0+0+2 \sqrt{\frac{2\log(KT/\delta)}{N_t(a)}} \end{aligned}\]

which uses optimism and the UCB rule. Then the total regret is bounded by

\[\begin{aligned} R_T=&R_{1:K}+R_{K+1:T}\\ \leq& K+\sum_{t=K+1}^T (\mu_{a^\star}-\mu_{A_t})\\ \leq & K+ \sum_{t=K+1}^T 2 \sqrt{\frac{2\log(KT/\delta)}{N_t(a)}}\\ \leq & K+2 \sqrt{2\log(KT/\delta)}\sum_{a=1}^K\sum_{i=1}^{N_T(a)}\frac{1}{\sqrt{i}}\\ \leq & K +4 \sqrt{2\log(KT/\delta)}\sum_{a=1}^K \sqrt{N_T(a)} \quad \text{eqn} (\star)\\ \leq & K +4 \sqrt{2\log(KT/\delta)}\sqrt{K\cdot \sum_{a=1}^K N_T(a)}\\ = & K +4 \sqrt{2KT\log(KT/\delta)}\\ \end{aligned}\]

Regret analysis: gap-dependent expression

Define the gap \(\Delta_a:= \mu_{a^\star}-\mu_a\). By the concentration result and the UCB rule, we know

\[\begin{aligned} \bar{Q}_t(a)\leq& \mu_a+\sqrt{\frac{2\log(2TK/\delta)}{N_t(a)}}\\ =&\mu_{a^\star}-\Delta_a+\sqrt{\frac{2\log(2TK/\delta)}{N_t(a)}}\\ \Leftrightarrow& \mu_{a^\star}-\bar{Q}_t(a)\geq \Delta_a-\sqrt{\frac{2\log(2TK/\delta)}{N_t(a)}} \end{aligned}\]

Note when \(\mu_{a^\star}-\bar{Q}_t(a)\geq 0\), then arm \(a\) will never be played again (since \(\bar{Q}_t(a^\star)\) always upper bounds \(\bar{Q}_t(a)\)) and \(N_t(a)\) will no longer change! Therefore from above we always have

\[0\geq \Delta_a-\sqrt{\frac{2\log(2TK/\delta)}{N_T(a)}}\Leftrightarrow N_T(a)\leq \frac{2\log(2TK/\delta)}{\Delta_a^2}\]

The former non-adaptive bound can be replaced by

\[R_T=\sum_{a=1}^K \Delta_a+{O}(\sum_{a \neq a^\star} \frac{1}{\Delta_a}\log(KT/\delta))\]

\(\textbf{Short note:}\) for the non-stochastic bandit setting (e.g. adversarial setting), there are algorithms (e.g. EXP3) that achieves the same regret. See Peter Auer et al. (2001).


2. Linear Bandits

The problem setup

  • Action space is a compact set \(x_t\in \mathcal{D}\subset R^d\);
  • Reward is linear with i.i.d mean-zero \(\sigma^2\)-subguassian noise:

\(r_t=\mu^\star\cdot x_t+\eta_t,\quad \mathbb{E}[r_t|x_t=x]=\mu^\star\cdot x\in[-1,1];\)

  • Agent chooses a sequence of actions \(x_1,\ldots,x_T\);
  • Let \(x^\star\in \text{argmax}_{x\in\mathcal{D}}\mu^\star\cdot x\), then the regret is defined as
\[R_T=T\cdot \langle \mu^\star,x^\star \rangle-\sum_{t=1}^T \langle \mu^\star,x_t \rangle.\]

2.1. LinUCB Algorithm

For each time \(t\), with collected data \((x_i,r_i)\) for all \(i=1,\ldots,t-1\)

  • Compute the estimated \(\widehat{\mu}_t\) through ridge regression:
\[\widehat{\mu}_t:=\text{argmin}_\theta \sum_{i=1}^{t-1}(x_i^\top \theta-r_i)^2+\lambda ||\theta||^2_2\]

Define \(\Sigma_t=\lambda I +\sum_{i=1}^{t-1}x_i x_i^\top\), then \(\widehat{\mu}_t:=\Sigma^{-1}_t\sum_{i=1}^{t-1}r_i x_i\)

  • Construct high probability confidence ellipsoid of the parameter
\[\text{Ball}_t=\{\mu| (\mu-\widehat{\mu}_t)^\top \Sigma_t(\mu-\widehat{\mu}_t)\leq \beta_t\}\]
  • Choose actions that maximize the UCB
\[x_t=\text{argmax}_{x\in\mathcal{D}}\max_{\mu\in\text{Ball}_t}\langle x,\mu\rangle\]

\(\textbf{Note:}\) the computation of LinUCB could be NP-hard though.


Theorem (Upper bound of LinUCB) Choose \(\beta_t=\tilde{O}(\sigma^2 \cdot d)\), suppose \(\lVert\mu^\star\rVert\leq W\), \(\lVert x\rVert\leq B\) for all \(x\in\mathcal{D}\). Then set \(\lambda=\sigma^2/W^2\), w.p. \(1-\delta\),

\[R_T\leq C\sigma \sqrt{T}(d\log(1+\frac{TB^2W^2}{d\sigma^2})+\log(4/\delta)).\]

2.2 Analysis of the LinUCB algorithm

The analysis is based on several lemmas.

  • Lemma 1: “Width” of confidence ball Let \(x\in\mathcal{D}\). If \(\mu\in\text{Ball}_t\), then
\[|(\mu-\widehat{\mu}_t)^\top x|\leq\sqrt{\beta_tx^\top \Sigma^{-1}_t x}\]
  • Lemma 2: Instaneous regret is bounded Fix \(t\leq T\) and define \(w_t:=\sqrt{x_t^\top \Sigma^{-1}_t x_t}\). If \(\mu^\star\in\text{Ball}_t\), then
\[\text{regret}_t\leq 2\min(\sqrt{\beta_t}w_t,1)\leq 2\sqrt{\beta_T}\min(w_t,1).\]
  • Lemma 3: “Geometric potential” argument We have:
\[\text{det}\Sigma_T=\text{det}\Sigma_0 \cdot \prod_{t=0}^{T-1}(1+w_t^2).\]

\(\textbf{Note:}\) The proof of this lemma is intersting, see Lemma 5.9 of AJKS.

  • Lemma 4 For any sequence \(x_0,\ldots,x_{T-1}\) such that for \(t<T\), \(\lVert x_t \rVert_2\leq B\), then
\[\log(\text{det}\Sigma_{T-1}/\text{det}\Sigma_0)=\log \text{det}(I+\frac{1}{\lambda}\sum_{t=0}^{T-1}x_tx_t^\top)\leq d\log(1+\frac{TB^2}{d\lambda}).\]

Combine Lemma 1-4, we have for LinUCB if \(\mu^\star\in\text{Ball}_t\) for all \(t\), then

\begin{equation}\label{eqn:l2norm} \sum_{t=0}^{T-1} \text{regret}_{t}^{2} \leq 4 \beta_T d\log(1+\frac{TB^2}{d\lambda}) \end{equation}

Also, it can be shown for \(\delta>0\), \(\mathbb{P}(\forall t,\mu^\star\in\text{Ball}_t)\geq 1-\delta\).

Combine \eqref{eqn:l2norm} and the above we can show w.p. \(1-\delta\),

\[R_{T}=\sum_{t=0}^{T-1} \text {regret}_{t} \leq \sqrt{T \sum_{t=0}^{T-1} \text { regret }_{t}^{2}} \leq \sqrt{4 T \beta_{T} d \log \left(1+\frac{T B^{2}}{d \lambda}\right)}\]

which finishes the proof.


The content of this post mainly comes from the Bandit Algorithms Book and AJKS.