This post summarizes the regret analysis for multi-armed bandit problems and linear bandits problems.

Specifically, the Exploration-first / epsilon-greedy algorithm achieves O~(K1/3T2/3) regret and UCB obtains O~(KT) regret where K is the numberof arms for MAB. For linear bandits, LinUCB obtains O~(σdT) regret where d is the feature dimension. Let us start with the multi-armed bandit (MAB) problems!


1. Multi-armed bandit (MAB) problems

In MAB, we have K actions (the “arms”) and when we play arm i{1,2,,K} we obtain a random reward ri which has mean reward:

E[ri]=μi,|μi|1.

Every iteration t, the learner will pick an arm At[1,2,,K]. The regret is defined as:

RT=Tmaxiμit=1TμAt.

The goal is to minimize the regret.


1.1 Exploration-first algorithm

Algorithm 1: Exploration-first

  • Spend the first N step exploring, where each action is played for N/K times. The corresponding estimates for each action a is: Q^t(a)=i=1tRi1[Ai=a]i=1t1[Ai=a],a[K]
  • For t = N+1,...,T: At:=argmaxaQ^t(a)

Analysis of the Exploration-first algorithm

Step1. For any tN, by Hoeffding’s inequality and an union bound, w.p. 1δ

supa[K]|Q^(a)μa|K2Nlog(2K/δ):=ϵ

Step2. Regret for the Exploration phase:

R1:NNKa[K](maxaμaμa)N

Step3. Regret for the Exploitation phase Ata^=argmaxaQ^(a):

RN+1:T(TN)(μaμa^)=(TN)[μaQ^(a)+Q^(a)Q^(a^)+Q^(a^)μa^](TN)[ϵ+0+ϵ]

Step4. The total regret is

RT=N+2TK2Nlog(2k/δ)=O(T2/3K1/3(log(2K/δ))1/3)

where the last equal sign chooses N=T2/3K1/3(log(2K/δ))1/3.


1.2 Epsilon-greedy algorithm

Algorithm 2: Epsilon-greedy

Let the strategy be:

  • With probability ϵ, choose the action uniformly at random;

  • With probability 1ϵ, select

At:=argmaxa[K]Q^t(a),

where Q^t is defined the same as Algorithm 1.

It can be shown the regret bound is

ϵTExploration+t=1TCKϵtExploitation

choose ϵ=T1/3K1/3 gives regret O~(T2/3K1/3).


1.3 Upper Condifence Bound (UCB) Algorithm

Optimism in the face of uncertainty: UCB

Algorithm 3: UCB

  • Play each action a[K] once (in total K steps);
  • For t=k+1,,T
    • Choose At:=argmaxaQ^t(a)+2log(2TK/δ)2Nt(a);

    • where Q^t(a)=1Nt(a)(Ra+i=k+1t11[Ai=a]Ri), Nt(a)=i=1t11[Ai=a].

Regret analysis: non-adaptive bound

By Azuma-Hoeffding’s inequality and an union bound, w.p. 1δ,

|Raμa+i=k+1t11[Ai=a](Riμa)|2Nt(a)log(KT/δ),a[K],t[k+1,T]

Note the above is equivalent to

supt,a1Nt(a)|Raμa+i=k+1t11[Ai=a](Riμa)|2log(KT/δ)supt,aNt(a)|Q^t(a)μa|2log(KT/δ)|Q^t(a)μa|2log(KT/δ)Nt(a),a[K],t

Recall Q¯t(a)=Q^t(a)+2log(2KT/δ)Nt(a), then at each time t,

μaμAt=μaQ¯t(a)+Q¯t(a)Q¯t(At)+Q¯t(At)μ(At)0+0+22log(KT/δ)Nt(a)

which uses optimism and the UCB rule. Then the total regret is bounded by

RT=R1:K+RK+1:TK+t=K+1T(μaμAt)K+t=K+1T22log(KT/δ)Nt(a)K+22log(KT/δ)a=1Ki=1NT(a)1iK+42log(KT/δ)a=1KNT(a)eqn()K+42log(KT/δ)Ka=1KNT(a)=K+42KTlog(KT/δ)

Regret analysis: gap-dependent expression

Define the gap Δa:=μaμa. By the concentration result and the UCB rule, we know

Q¯t(a)μa+2log(2TK/δ)Nt(a)=μaΔa+2log(2TK/δ)Nt(a)μaQ¯t(a)Δa2log(2TK/δ)Nt(a)

Note when μaQ¯t(a)0, then arm a will never be played again (since Q¯t(a) always upper bounds Q¯t(a)) and Nt(a) will no longer change! Therefore from above we always have

0Δa2log(2TK/δ)NT(a)NT(a)2log(2TK/δ)Δa2

The former non-adaptive bound can be replaced by

RT=a=1KΔa+O(aa1Δalog(KT/δ))

Short note: for the non-stochastic bandit setting (e.g. adversarial setting), there are algorithms (e.g. EXP3) that achieves the same regret. See Peter Auer et al. (2001).


2. Linear Bandits

The problem setup

  • Action space is a compact set xtDRd;
  • Reward is linear with i.i.d mean-zero σ2-subguassian noise:

rt=μxt+ηt,E[rt|xt=x]=μx[1,1];

  • Agent chooses a sequence of actions x1,,xT;
  • Let xargmaxxDμx, then the regret is defined as
RT=Tμ,xt=1Tμ,xt.

2.1. LinUCB Algorithm

For each time t, with collected data (xi,ri) for all i=1,,t1

  • Compute the estimated μ^t through ridge regression:
μ^t:=argminθi=1t1(xiθri)2+λ||θ||22

Define Σt=λI+i=1t1xixi, then μ^t:=Σt1i=1t1rixi

  • Construct high probability confidence ellipsoid of the parameter
Ballt={μ|(μμ^t)Σt(μμ^t)βt}
  • Choose actions that maximize the UCB
xt=argmaxxDmaxμBalltx,μ

Note: the computation of LinUCB could be NP-hard though.


Theorem (Upper bound of LinUCB) Choose βt=O~(σ2d), suppose μW, xB for all xD. Then set λ=σ2/W2, w.p. 1δ,

RTCσT(dlog(1+TB2W2dσ2)+log(4/δ)).

2.2 Analysis of the LinUCB algorithm

The analysis is based on several lemmas.

  • Lemma 1: “Width” of confidence ball Let xD. If μBallt, then
|(μμ^t)x|βtxΣt1x
  • Lemma 2: Instaneous regret is bounded Fix tT and define wt:=xtΣt1xt. If μBallt, then
regrett2min(βtwt,1)2βTmin(wt,1).
  • Lemma 3: “Geometric potential” argument We have:
detΣT=detΣ0t=0T1(1+wt2).

Note: The proof of this lemma is intersting, see Lemma 5.9 of AJKS.

  • Lemma 4 For any sequence x0,,xT1 such that for t<T, xt2B, then
log(detΣT1/detΣ0)=logdet(I+1λt=0T1xtxt)dlog(1+TB2dλ).

Combine Lemma 1-4, we have for LinUCB if μBallt for all t, then

(1)t=0T1regrett24βTdlog(1+TB2dλ)

Also, it can be shown for δ>0, P(t,μBallt)1δ.

Combine (1) and the above we can show w.p. 1δ,

RT=t=0T1regrettTt=0T1 regret t24TβTdlog(1+TB2dλ)

which finishes the proof.


The content of this post mainly comes from the Bandit Algorithms Book and AJKS.