Visitations and On-Policy Distributions

January 7, 2019 · 7 min read

Hello

Ep: 0

Recap and Notation
Return (also discounting)
Value functions
Bellman Equations
Relation between Q and V (the same up to a change of variables)

Ep: 1

Goal of episodic tasks: expected return. The sum of rewards.
Introducing visitations and on-policy distributions
(Deriving the Bellman Eq for Visitations and On-Policy Distributions)?
Expressing the expected return in terms of the on-policy distribution

Ep: 2 Riddle

Ep: 3

What about continuing tasks? We need to make the sum of rewards finite!
Discounting: the standard approach
Can we still express this in terms of an on-policy distribution?

Ep: 4

Convex Optimization
LP Formulation

Ep0: Recap

Notation

Expected return

G_t = \sum_{k=t+1}^T R_k

Value functions

v_\pi(s) \doteq \E_\pi \left[ G_t \mid S_t = s \right] = \E_\pi \left[\sum_{k=t+1}^T R_k \middle| S_t = s \right]

Bellman Equation for $v_\pi$

\begin{aligned} v_\pi(s) \doteq & \E_\pi \left[ G_t \mid S_t = s \right] = \\ = & \E_\pi \left[ R_{t+1}+ G_{t+1} \mid S_t = s \right] \\ = & r(s) + \sum_{\bar s} p^\pi(\bar s|s)v_\pi(\bar s) \end{aligned}

Ep 1: Bounded Tasks

Bellman Equation for visitations

What is the exact range for the sum? from $0$ to $T-1$ I would say. If you consider $S_T$ to be a special terminal state you don’t need to compute it’s counts.

\begin{aligned} \eta(s) & \doteq \E \left[ \sum_{t=0}^T \1\{S_t=s\} \right] \\ &= \sum_{t=0}^T \E \left[ \1\{S_t=s\} \right] \\ &= \sum_{t=0}^T \sum_{\bar s} \P\{S_t=\bar s\} \1\{\bar s=s\} \\ &= \sum_{t=0}^T \P\{S_t=s\} \\ &= \P\{S_0=s\} + \sum_{t=1}^T \P\{S_t=s\} \\ &= h(s) + \sum_{t=1}^T \sum_{\bar s} \P\{S_{t-1}=\bar s\} p^\pi(s|\bar s) \\ &= h(s) + \sum_{\bar s} p^\pi(s|\bar s) \sum_{t=1}^T \P\{S_{t-1}=\bar s\} \\ \downarrow& \textit{ Change variables with: } k=t-1\\ &= h(s) + \sum_{\bar s} p^\pi(s|\bar s) \sum_{k=0}^T \P\{S_k=\bar s\} \\ &= h(s) + \sum_{\bar s} p^\pi(s|\bar s) \eta(\bar s) \end{aligned}

Sum of visitations over all states

This does not work unless the range is until $T-1$

\begin{aligned} \sum_s \eta(s) &= \sum_s \sum_{t=0}^T \P\{S_t=s\} \\ &= \sum_{t=0}^T \sum_s \P\{S_t=s\} \\ &= \sum_{t=0}^T = T \end{aligned}

On-policy Distribution

\mu(s) \doteq \frac{\eta(s)}{\sum_{\bar s}\eta(\bar s)} = \frac{\eta(s)}T = \frac{h(s)}T + \sum_{\bar s} p^\pi(s|\bar s) \mu(\bar s)

Expressing the exptected return in terms of $\mu$

Again, careful with index variable in third row
This is only valid for $G_0$

\begin{aligned} \E [ G_t ] &= \E \left[ \sum_{t=1}^T R_t \right] = \sum_{t=1}^T \E \left[ R_t \right] \\ &= \sum_{t=1}^T \sum_s \P\{S_t=s\}r(s) \\ &= \sum_s r(s) \sum_{t=1}^T \P\{S_t=s\} \\ &= \sum_s r(s)\eta(s) \\ &= T \sum_s r(s)\mu(s) \\ &= T \E[ r(S) | S \sim \mu ] \end{aligned}

This is saying that the value of a policy is the average reward over all states times the number of timesteps.

(If we assume that T is a random variable that changes between episodes I don’t think the formula above is written correctly)

Ep 3: Discounting

\begin{aligned} \E [ G_t ] &= \E \left[ \sum_{t=0}^\infty \gamma^t R_t \right] = \sum_{t=0}^\infty \gamma^t \E \left[ R_t \right] \\ &= \sum_{t=0}^\infty \gamma^t \sum_s \P\{S_t=s\}r(s) \\ &= \sum_s r(s) \sum_{t=0}^\infty \gamma^t \P\{S_t=s\} \\ &= \sum_s r(s)\eta(s) \\ &= \frac1{1-\gamma} \sum_s r(s)\mu(s) \\ &= \frac1{1-\gamma} \E[ r(S) | S \sim \mu ] \end{aligned}

Bellman Equation for visitations

\begin{aligned} \eta(s) & \doteq \E \left[ \sum_{t=0}^\infty \gamma^t \1\{S_t=s\} \right] \\ &= \sum_{t=0}^\infty \gamma^t \E \left[ \1\{S_t=s\} \right] \\ &= \sum_{t=0}^\infty \gamma^t \sum_{\bar s} \P\{S_t=\bar s\} \1\{\bar s=s\} \\ &= \sum_{t=0}^\infty \gamma^t \P\{S_t=s\} \\ &= \P\{S_0=s\} + \sum_{t=1}^\infty \gamma^t \P\{S_t=s\} \\ &= h(s) + \sum_{t=1}^\infty \gamma^t \sum_{\bar s} \P\{S_{t-1}=\bar s\} p^\pi(s|\bar s) \\ &= h(s) + \sum_{\bar s} p^\pi(s|\bar s) \sum_{t=1}^\infty \gamma^t \P\{S_{t-1}=\bar s\} \\ \downarrow& \textit{ Change variables with: } k=t-1\\ &= h(s) + \sum_{\bar s} p^\pi(s|\bar s) \sum_{k=0}^\infty \gamma^{k+1} \P\{S_k=\bar s\} \\ &= h(s) + \gamma \sum_{\bar s} \eta(\bar s) p^\pi(s|\bar s) \end{aligned}

Sum of visitations over all states

\begin{aligned} &\eta(s) = h(s) + \gamma \sum_{\bar s} \eta(\bar s) p^\pi(s|\bar s) \\ \iff& \sum_s \eta(s) = \sum_s h(s) + \sum_s \gamma \sum_{\bar s} \eta(\bar s) p^\pi(s|\bar s) \\ \iff& \sum_s \eta(s) - \gamma \sum_{\bar s} \eta(\bar s) \sum_s p^\pi(s|\bar s) = 1 \\ \iff& \sum_s \eta(s) - \gamma \sum_{\bar s} \eta(\bar s) = 1 \\ \iff& (1-\gamma)\sum_s \eta(s) = 1 \\ \iff& \sum_s \eta(s) = \frac1{1-\gamma} \end{aligned}

On-policy Distribution

\begin{aligned} \mu(s) &\doteq \frac{\eta(s)}{\sum_{\bar s}\eta(\bar s)} \\ &= (1-\gamma)h(s) + \gamma \sum_{\bar s} \mu(\bar s) p^\pi(s|\bar s) \end{aligned}

#reinforcement-learning

Random thoughts

Twitter GitHub Keybase

Visitations and On-Policy Distributions

Ep0: Recap

Notation

Expected return

Value functions

Bellman Equation for vπv_\pivπ​

Ep 1: Bounded Tasks

Bellman Equation for visitations

Sum of visitations over all states

On-policy Distribution

Expressing the exptected return in terms of μ\muμ

Ep 3: Discounting

Bellman Equation for visitations

Sum of visitations over all states

On-policy Distribution

Bellman Equation for $v_\pi$

Expressing the exptected return in terms of $\mu$