TMIS (Plug-in) estimator is statistically efficient for Tabular OPE

Surprisingly, Monte Carlo on-policy estimator is actually statistically inefficient.

Brief background of OPE

Historical data $\mathcal{D}=\left\lbrace (s_t^{(i)},a_t^{(i)},r_t^{(i)})\right\rbrace_{i\in[n]}^{t\in[H]}$ was obtained by logging policy $\mu$ and we can only use $\mathcal{D}$ to estimate the value of target policy $\pi$, i.e. $v^\pi$. Suppose we only assume knowledge about $\pi$ and do not observe $r_t(s_t,a_t)$ for any actions other than the noisy immediate reward $r_t^{(i)}$ after observing $s_t^{(i)},a_t^{(i)}$. The goal is to find an estimator to minimize the mean-square error (MSE):

\[MSE(\pi,\mu,M)=\mathbb{E}_\mu[(\hat{v}^\pi-v^\pi)^2]\]

It is known from Theorem 3 of that for discrete DAG MDPs, the variance of any unbiased estimator is lower bounded byThe randomness is taken over $s_t,a_t,r_t$ for expectation.

\[\sum_{t=0}^{H}\mathbb{E}_\mu\left[ \frac{d^\pi(s_t,a_t)^2}{d^\mu(s_t,a_t)^2}\mathrm{Var}\Big[V_{t+1}^\pi(s_{t+1})+r_t\Big| s_{t}, a_t\Big]\right].\]

Tabular MIS estimator is statistically efficient

Follow (Yin & Wang, 2020) , Tabular MIS $\hat{v}^\pi_\text{TMIS}$ estimats

\[\widehat{P}_{t+1}(s_{t+1}|s_{t},a_{t})=\frac{\sum_{i=1}^n\mathbf{1}[(s^{(i)}_{t+1},a^{(i)}_t,s^{(i)}_t)=(s_{t+1},s_{t},a_{t})]}{n_{s_{t},a_{t}}}; \widehat{r}_t(s_t,a_t)=\frac{\sum_{i=1}^n r_t^{(i)}\mathbf{1}[(s^{(i)}_t,a^{(i)}_t)=(s_t,a_t)]}{n_{s_t,a_t}},\]

which results in (Theorem 3.1. in )

\[\mathbb{E}[ (\hat{v}_{\mathrm{TMIS}}^\pi - v^\pi)^2]\leq \frac{1}{n}\sum_{t=0}^{H}\mathbb{E}_\mu\left[ \frac{d^\pi(s_t,a_t)^2}{d^\mu(s_t,a_t)^2}\mathrm{Var}\Big[V_{t+1}^\pi(s_{t+1})+r_t\Big| s_{t}, a_t\Big]\right]+O(\cdot),\]

where $O(\cdot)$ is a higher order term. This further implies

\[\lim_{n\rightarrow\infty} n\cdot \mathbb{E}[ (\hat{v}_{\mathrm{TMIS}}^\pi - v^\pi)^2]=\sum_{t=0}^{H}\mathbb{E}_\mu\left[ \frac{d^\pi(s_t,a_t)^2}{d^\mu(s_t,a_t)^2}\mathrm{Var}\Big[V_{t+1}^\pi(s_{t+1})+r_t\Big| s_{t}, a_t\Big]\right].\]

Therefore TMIS is statistically efficient as it matches the lower bound given by .

Monte Carlo on-policy estimator is statistically inefficient!

To understand this, note on-policy estimator uses

\[\hat{v}^\pi_\textbf{on}=\frac{1}{n}\sum_{t=0}^H\sum_{i=1}^n r_t^{(i)},\]

and the variance of $ \hat{v}^\pi_\textbf{on}$ becomes (by Lemma 3.4 of )

\[\frac{1}{n}\mathrm{Var}\left[\sum_{t=0}^H r_t\right]= \frac{1}{n}\sum_{t=1}^H \Big(\mathbb{E}_\pi\left[ \mathrm{Var}\left[r_t+V^\pi_{t+1}(s_{t+1}) \middle|s_t,a_t\right] \right] + \mathbb{E}_\pi\left[ \mathrm{Var}\left[ \mathbb{E}_\pi[r_t+V^\pi_{t+1}(s_{t+1}) | s_t, a_t] \middle|s_t\right] \right]\Big).\]

Hence the asymptotic variance of $ \hat{v}^\pi_\textbf{on}$ has

\[\lim_{n\rightarrow\infty} n\cdot \mathrm{Var}(\hat{v}^\pi_\textbf{on})= \sum_{t=1}^H \Big(\mathbb{E}_\pi\left[ \mathrm{Var}\left[r_t+V^\pi_{t+1}(s_{t+1}) \middle|s_t,a_t\right] \right] + \mathbb{E}_\pi\left[ \mathrm{Var}\left[ \mathbb{E}_\pi[r_t+V^\pi_{t+1}(s_{t+1}) | s_t, a_t] \middle|s_t\right] \right]\Big).\]

which is greater than the on-policy ($\pi=\mu$) lower bound

\[\sum_{t=0}^{H}\mathbb{E}_\pi\left[ \mathrm{Var}\Big[V_{t+1}^\pi(s_{t+1})+r_t\Big| s_{t}, a_t\Big]\right]!\]