A algorithm that achieves Minimax rate for tabular RL
The discussion is based on
Historical data $\mathcal{D}=\left\lbrace (s_t^{(i)},a_t^{(i)},r_t^{(i)})\right\rbrace_{i\in[n]}^{t\in[H]} $ was obtained by logging policy $\mu$ and we can only use $\mathcal{D}$ to estimate the value of target policy $\pi$, i.e. $v^\pi$. Suppose we only assume knowledge about $\pi$ and $r_t^{(i)} = r_t(s_t^{(i)},a_t^{(i)})$. The goal of offline learning task is to find an $\epsilon$-optimal policy $\pi_\text{out}$, such that
\[\left\lVert V_1^{\pi^\star}-V_1^{\pi_\text{out}}\right\rVert_\infty<\epsilon.\]In particular,
In the case of policy optimization, VR is an algorithm that approximately iterating the Bellman optimality equation, using an inner loop that performs an approximate value (or Q-value) iteration using fresh interactive data to estimate $V^\star$, and an outer loop that performs multiple steps of such iterations to refine the estimates. Concretely, to obtain an reliable $Q_t(s,a)$ for some step $t\in[H]$, by the Bellman equation $Q_t(s,a)=r(s,a)+P_t^\top(\cdot \mid s,a)V_{t+1}$, we need to estimate $P_t^\top(\cdot\mid s,a)$ with sufficient accuracy. VR handles this by decomposing:
\begin{equation}\label{eq:VR_decomposition} \quad P_t^\top(\cdot|s,a)V_{t+1} =P_t^\top(\cdot|s,a)(V_{t+1}-V_{t+1}^{\text{in}})+P_t^\top(\cdot|s,a)V_{t+1}^{\text{in}}, \end{equation}
where $V_{t+1}^{\text{in}}$ is a reference value function obtained from previous calculation and $P_t^\top(\cdot\mid s,a)(V_{t+1}-V_{t+1}^{\text{in}})$, $P_t^\top(\cdot\mid s,a)V_{t+1}^{\text{in}}$ are estimated separately at different stages. This technique can help in reducing the effective variance along the learning process.
In particular, we design the Off-Policy Double Variance Reduction (OPDVR) algorithm to achieve the following:
All of above have minimax rate in their respective settings! If you are interested, please check
My nice collaborator also shared this on twitter:
New preprint on offline RL:https://t.co/2vv2KLA1TF
— Yu Bai (@yubai01) February 8, 2021
* A variance reduction algorithm for offline RL
* Optimal horizon dependence: O(H^2/d_m) sample complexity on time-homogeneous MDPs
Joint w/ Ming Yin (@MingYin_0312) and Yu-Xiang Wang