Variance Reduction Technique for Optimal Offline RL

A algorithm that achieves Minimax rate for tabular RL

The discussion is based on .

Brief background of Offline Learning

Historical data $\mathcal{D}=\left\lbrace (s_t^{(i)},a_t^{(i)},r_t^{(i)})\right\rbrace_{i\in[n]}^{t\in[H]} $ was obtained by logging policy $\mu$ and we can only use $\mathcal{D}$ to estimate the value of target policy $\pi$, i.e. $v^\pi$. Suppose we only assume knowledge about $\pi$ and $r_t^{(i)} = r_t(s_t^{(i)},a_t^{(i)})$. The goal of offline learning task is to find an $\epsilon$-optimal policy $\pi_\text{out}$, such that

\[\left\lVert V_1^{\pi^\star}-V_1^{\pi_\text{out}}\right\rVert_\infty<\epsilon.\]

In particular, obtains the $\tilde{O}(H^3/d_m\epsilon^2)$ complexity and further tightens the result to $\tilde{O}(H^2/d_m\epsilon^2)$ via a Variance Reducetion based algorithm.

A brief review of Variance Reduction for RL

In the case of policy optimization, VR is an algorithm that approximately iterating the Bellman optimality equation, using an inner loop that performs an approximate value (or Q-value) iteration using fresh interactive data to estimate $V^\star$, and an outer loop that performs multiple steps of such iterations to refine the estimates. Concretely, to obtain an reliable $Q_t(s,a)$ for some step $t\in[H]$, by the Bellman equation $Q_t(s,a)=r(s,a)+P_t^\top(\cdot \mid s,a)V_{t+1}$, we need to estimate $P_t^\top(\cdot\mid s,a)$ with sufficient accuracy. VR handles this by decomposing:

\begin{equation}\label{eq:VR_decomposition} \quad P_t^\top(\cdot|s,a)V_{t+1} =P_t^\top(\cdot|s,a)(V_{t+1}-V_{t+1}^{\text{in}})+P_t^\top(\cdot|s,a)V_{t+1}^{\text{in}}, \end{equation}

where $V_{t+1}^{\text{in}}$ is a reference value function obtained from previous calculation and $P_t^\top(\cdot\mid s,a)(V_{t+1}-V_{t+1}^{\text{in}})$, $P_t^\top(\cdot\mid s,a)V_{t+1}^{\text{in}}$ are estimated separately at different stages. This technique can help in reducing the effective variance along the learning process.

Hightlights of our results

In particular, we design the Off-Policy Double Variance Reduction (OPDVR) algorithm to achieve the following:

For finite horizon non-stationary transition (time-variant) setting, OPDVR outputs a $\epsilon$-optimal policy with complexity $\tilde{O}(H^3/d_m\epsilon^2)$;
For finite horizon stationary transition (time-invariant) setting, OPDVR outputs a $\epsilon$-optimal policy with complexity $\tilde{O}(H^2/d_m\epsilon^2)$;
For infinite horizon discounted setting, OPDVR outputs a $\epsilon$-optimal policy with complexity $\tilde{O}((1-\gamma)^{-3}/d_m\epsilon^2)$;

All of above have minimax rate in their respective settings! If you are interested, please check for a reference.

Miscellaneous

My nice collaborator also shared this on twitter:

New preprint on offline RL:https://t.co/2vv2KLA1TF

* A variance reduction algorithm for offline RL
* Optimal horizon dependence: O(H^2/d_m) sample complexity on time-homogeneous MDPs

Joint w/ Ming Yin (@MingYin_0312) and Yu-Xiang Wang
— Yu Bai (@yubai01) February 8, 2021