Variance Reduction Technique for Optimal Offline RL

A algorithm that achieves Minimax rate for tabular RL

The discussion is based on .

Brief background of Offline Learning

Historical data $\mathcal{D}=\left\lbrace (s_t^{(i)},a_t^{(i)},r_t^{(i)})\right\rbrace_{i\in[n]}^{t\in[H]} $ was obtained by logging policy $\mu$ and we can only use $\mathcal{D}$ to estimate the value of target policy $\pi$, i.e. $v^\pi$. Suppose we only assume knowledge about $\pi$ and $r_t^{(i)} = r_t(s_t^{(i)},a_t^{(i)})$. The goal of offline learning task is to find an $\epsilon$-optimal policy $\pi_\text{out}$, such that

\[\left\lVert V_1^{\pi^\star}-V_1^{\pi_\text{out}}\right\rVert_\infty<\epsilon.\]

In particular, obtains the $\tilde{O}(H^3/d_m\epsilon^2)$ complexity and further tightens the result to $\tilde{O}(H^2/d_m\epsilon^2)$ via a Variance Reducetion based algorithm.

A brief review of Variance Reduction for RL

In the case of policy optimization, VR is an algorithm that approximately iterating the Bellman optimality equation, using an inner loop that performs an approximate value (or Q-value) iteration using fresh interactive data to estimate $V^\star$, and an outer loop that performs multiple steps of such iterations to refine the estimates. Concretely, to obtain an reliable $Q_t(s,a)$ for some step $t\in[H]$, by the Bellman equation $Q_t(s,a)=r(s,a)+P_t^\top(\cdot \mid s,a)V_{t+1}$, we need to estimate $P_t^\top(\cdot\mid s,a)$ with sufficient accuracy. VR handles this by decomposing:

\begin{equation}\label{eq:VR_decomposition} \quad P_t^\top(\cdot|s,a)V_{t+1} =P_t^\top(\cdot|s,a)(V_{t+1}-V_{t+1}^{\text{in}})+P_t^\top(\cdot|s,a)V_{t+1}^{\text{in}}, \end{equation}

where $V_{t+1}^{\text{in}}$ is a reference value function obtained from previous calculation and $P_t^\top(\cdot\mid s,a)(V_{t+1}-V_{t+1}^{\text{in}})$, $P_t^\top(\cdot\mid s,a)V_{t+1}^{\text{in}}$ are estimated separately at different stages. This technique can help in reducing the effective variance along the learning process.


Hightlights of our results

In particular, we design the Off-Policy Double Variance Reduction (OPDVR) algorithm to achieve the following:

All of above have minimax rate in their respective settings! If you are interested, please check for a reference.


Miscellaneous

My nice collaborator also shared this on twitter: