A model-based framework + singleton absorbing MDP technique achieves the optimal rate for several challenging offline tasks.
The post explains the idea of paper
Historical data $\mathcal{D}=\left\lbrace (s_t^{(i)},a_t^{(i)},r_t^{(i)})\right\rbrace_{i\in[n]}^{t\in[H]} $ was obtained by logging policy $\mu$ and we can only use $\mathcal{D}$ to estimate the value of target policy $\pi$, i.e. $v^\pi$. Suppose we only assume knowledge about $\pi$ and $r_t^{(i)} = r_t(s_t^{(i)},a_t^{(i)})$. The goal of offline learning task is to find an $\epsilon$-optimal policy $\pi_\text{out}$, such that
\[\left\lVert V_1^{\pi^\star}-V_1^{\pi_\text{out}}\right\rVert_\infty<\epsilon.\]Especially,
For analyzing either learning or uniform evaluation tasks, at some point one needs to apply concentration inequalities (usually Bernstein inequality for the sharp result) over the term $(\widehat{P}-P)\widehat{V}^\pi_t$ for the current estimates. Espeically, in offline RL, those $\widehat{P}$ are usually constructed via the model-based estimates (e.g.
and $n_{s_t,a_t}=\sum_{i=1}^n \mathbf{1}[s_t^{(i)},a_t^{(i)}=s,a]$. To further kill the dependence in $H$, the time-inhomogeneous case denotes $n_{s, a}:=\sum_{i=1}^{n} \sum_{h=1}^{H} \mathbf{1}\left[s_{h}^{(i)}, a_{h}^{(i)}=s, a\right]$ and uses the estimate
\[\widehat{P}\left(s^{\prime} \mid s, a\right)=\frac{\sum_{i=1}^{n} \sum_{h=1}^{H} \mathbf{1}\left[\left(s_{h+1}^{(i)}, a_{h}^{(i)}, s_{h}^{(i)}\right)=\left(s^{\prime}, s, a\right)\right]}{n_{s, a}}\]In this scenario, $\widehat{P}$ and $\widehat{V}^\pi_{t+1}$ are no longer independent since $\widehat{P}$ uses the samples acorss different times. How to deal with it?
The visualization can help understand the singleton absorbing MDP technique. The upper part demonstrates the infinite horizon case and the bottom part demonstrates the finite horizon case. In particular, it should be a $H$-dimensional hypercube $[0,H]^H$ (that contains $\widehat{V}^\star_1,\ldots,\widehat{V}^\star_h$) instead of only the square $[0,H]\times[0,H]$ ($\widehat{V}^\star_1,\widehat{V}^\star_2$). This is only for the ease of visualization.
The standard absorbing MDP technique
Now that for a state $s$ we can design $\widehat{V}^\star_{h,u^\star}$ such that \(\widehat{P}_s\) is independent of $\widehat{V}^\star_{h,u^\star}$ and concentration inequalities apply! (check
The model-based design together with the singleton absorbing MDP analysis is able to achieve the following:
All of above have minimax rates in their respective settings! If you are interested, please check
One side product is that the singleton absorbing MDP works in the infinite horizon setting as well! This means singleton absorbing MDP is not only optimal and but also agnostic to the settings.