Surprisingly, Monte Carlo on-policy estimator is actually statistically inefficient.
Historical data \(\mathcal{D}=\left\lbrace (s_t^{(i)},a_t^{(i)},r_t^{(i)})\right\rbrace_{i\in[n]}^{t\in[H]}\) was obtained by logging policy \(\mu\) and we can only use \(\mathcal{D}\) to estimate the value of target policy \(\pi\), i.e. \(v^\pi\). Suppose we only assume knowledge about \(\pi\) and do not observe \(r_t(s_t,a_t)\) for any actions other than the noisy immediate reward \(r_t^{(i)}\) after observing \(s_t^{(i)},a_t^{(i)}\). The goal is to find an estimator to minimize the mean-square error (MSE):
\[MSE(\pi,\mu,M)=\mathbb{E}_\mu[(\hat{v}^\pi-v^\pi)^2]\]It is known from Theorem 3 of
Follow (Yin & Wang, 2020)
which results in (Theorem 3.1. in
where $O(\cdot)$ is a higher order term. This further implies
\[\lim_{n\rightarrow\infty} n\cdot \mathbb{E}[ (\hat{v}_{\mathrm{TMIS}}^\pi - v^\pi)^2]=\sum_{t=0}^{H}\mathbb{E}_\mu\left[ \frac{d^\pi(s_t,a_t)^2}{d^\mu(s_t,a_t)^2}\mathrm{Var}\Big[V_{t+1}^\pi(s_{t+1})+r_t\Big| s_{t}, a_t\Big]\right].\]Therefore TMIS is statistically efficient as it matches the lower bound given by
To understand this, note on-policy estimator uses
\[\hat{v}^\pi_\textbf{on}=\frac{1}{n}\sum_{t=0}^H\sum_{i=1}^n r_t^{(i)},\]and the variance of $ \hat{v}^\pi_\textbf{on}$ becomes (by Lemma 3.4 of
Hence the asymptotic variance of $ \hat{v}^\pi_\textbf{on}$ has
\[\lim_{n\rightarrow\infty} n\cdot \mathrm{Var}(\hat{v}^\pi_\textbf{on})= \sum_{t=1}^H \Big(\mathbb{E}_\pi\left[ \mathrm{Var}\left[r_t+V^\pi_{t+1}(s_{t+1}) \middle|s_t,a_t\right] \right] + \mathbb{E}_\pi\left[ \mathrm{Var}\left[ \mathbb{E}_\pi[r_t+V^\pi_{t+1}(s_{t+1}) | s_t, a_t] \middle|s_t\right] \right]\Big).\]which is greater than the on-policy ($\pi=\mu$) lower bound
\[\sum_{t=0}^{H}\mathbb{E}_\pi\left[ \mathrm{Var}\Big[V_{t+1}^\pi(s_{t+1})+r_t\Big| s_{t}, a_t\Big]\right]!\]