A Brief Summary of Upper Bounds for Bandit Problems
This post summarizes the regret analysis for multi-armed bandit problems and linear bandits problems.
Specifically, the Exploration-first / epsilon-greedy algorithm achieves
1. Multi-armed bandit (MAB) problems
In MAB, we have
Every iteration
The goal is to minimize the regret.
1.1 Exploration-first algorithm
Algorithm 1: Exploration-first
- Spend the first N step exploring, where each action is played for N/K times. The corresponding estimates for each action a is:
- For t = N+1,...,T:
Analysis of the Exploration-first algorithm
where the last equal sign chooses
1.2 Epsilon-greedy algorithm
Algorithm 2: Epsilon-greedy
Let the strategy be:
-
With probability
, choose the action uniformly at random; -
With probability
, select
where
It can be shown the regret bound is
choose
1.3 Upper Condifence Bound (UCB) Algorithm
Algorithm 3: UCB
- Play each action
once (in total steps); - For
-
Choose
-
where
, .
-
Regret analysis: non-adaptive bound
By Azuma-Hoeffding’s inequality and an union bound, w.p.
Note the above is equivalent to
Recall
which uses optimism and the UCB rule. Then the total regret is bounded by
Regret analysis: gap-dependent expression
Define the gap
Note when
The former non-adaptive bound can be replaced by
2. Linear Bandits
The problem setup
- Action space is a compact set
; - Reward is linear with i.i.d mean-zero
-subguassian noise:
- Agent chooses a sequence of actions
; - Let
, then the regret is defined as
2.1. LinUCB Algorithm
For each time
- Compute the estimated
through ridge regression:
Define
- Construct high probability confidence ellipsoid of the parameter
- Choose actions that maximize the UCB
Theorem (Upper bound of LinUCB) Choose
2.2 Analysis of the LinUCB algorithm
The analysis is based on several lemmas.
- Lemma 1: “Width” of confidence ball Let
. If , then
- Lemma 2: Instaneous regret is bounded Fix
and define . If , then
- Lemma 3: “Geometric potential” argument We have:
- Lemma 4 For any sequence
such that for , , then
Combine Lemma 1-4, we have for LinUCB if
Also, it can be shown for
Combine
which finishes the proof.
The content of this post mainly comes from the Bandit Algorithms Book and AJKS.