# Visitations and On-Policy Distributions

Hello

Ep: 0

• Recap and Notation
• Return (also discounting)
• Value functions
• Bellman Equations
• Relation between Q and V (the same up to a change of variables)

Ep: 1

• Goal of episodic tasks: expected return. The sum of rewards.
• Introducing visitations and on-policy distributions
• (Deriving the Bellman Eq for Visitations and On-Policy Distributions)?
• Expressing the expected return in terms of the on-policy distribution

Ep: 2 Riddle

Ep: 3

• What about continuing tasks? We need to make the sum of rewards finite!
• Discounting: the standard approach
• Can we still express this in terms of an on-policy distribution?

Ep: 4

• Convex Optimization
• LP Formulation

# Ep0: Recap

## Notation

### Expected return

$G_t = \sum_{k=t+1}^T R_k$

### Value functions

$v_\pi(s) \doteq \E_\pi \left[ G_t \mid S_t = s \right] = \E_\pi \left[\sum_{k=t+1}^T R_k \middle| S_t = s \right]$

### Bellman Equation for $v_\pi$

\begin{aligned} v_\pi(s) \doteq & \E_\pi \left[ G_t \mid S_t = s \right] = \\ = & \E_\pi \left[ R_{t+1}+ G_{t+1} \mid S_t = s \right] \\ = & r(s) + \sum_{\bar s} p^\pi(\bar s|s)v_\pi(\bar s) \end{aligned}

### Bellman Equation for visitations

What is the exact range for the sum? from $0$ to $T-1$ I would say. If you consider $S_T$ to be a special terminal state you don't need to compute it's counts.

\begin{aligned} \eta(s) & \doteq \E \left[ \sum_{t=0}^T \1\{S_t=s\} \right] \\ &= \sum_{t=0}^T \E \left[ \1\{S_t=s\} \right] \\ &= \sum_{t=0}^T \sum_{\bar s} \P\{S_t=\bar s\} \1\{\bar s=s\} \\ &= \sum_{t=0}^T \P\{S_t=s\} \\ &= \P\{S_0=s\} + \sum_{t=1}^T \P\{S_t=s\} \\ &= h(s) + \sum_{t=1}^T \sum_{\bar s} \P\{S_{t-1}=\bar s\} p^\pi(s|\bar s) \\ &= h(s) + \sum_{\bar s} p^\pi(s|\bar s) \sum_{t=1}^T \P\{S_{t-1}=\bar s\} \\ \downarrow& \textit{ Change variables with: } k=t-1\\ &= h(s) + \sum_{\bar s} p^\pi(s|\bar s) \sum_{k=0}^T \P\{S_k=\bar s\} \\ &= h(s) + \sum_{\bar s} p^\pi(s|\bar s) \eta(\bar s) \end{aligned}

### Sum of visitations over all states

This does not work unless the range is until $T-1$

\begin{aligned} \sum_s \eta(s) &= \sum_s \sum_{t=0}^T \P\{S_t=s\} \\ &= \sum_{t=0}^T \sum_s \P\{S_t=s\} \\ &= \sum_{t=0}^T = T \end{aligned}

### On-policy Distribution

$\mu(s) \doteq \frac{\eta(s)}{\sum_{\bar s}\eta(\bar s)} = \frac{\eta(s)}T = \frac{h(s)}T + \sum_{\bar s} p^\pi(s|\bar s) \mu(\bar s)$

### Expressing the exptected return in terms of $\mu$

• Again, careful with index variable in third row
• This is only valid for $G_0$
\begin{aligned} \E [ G_t ] &= \E \left[ \sum_{t=1}^T R_t \right] = \sum_{t=1}^T \E \left[ R_t \right] \\ &= \sum_{t=1}^T \sum_s \P\{S_t=s\}r(s) \\ &= \sum_s r(s) \sum_{t=1}^T \P\{S_t=s\} \\ &= \sum_s r(s)\eta(s) \\ &= T \sum_s r(s)\mu(s) \\ &= T \E[ r(S) | S \sim \mu ] \end{aligned}

This is saying that the value of a policy is the average reward over all states times the number of timesteps.

(If we assume that T is a random variable that changes between episodes I don't think the formula above is written correctly)

# Ep 3: Discounting

\begin{aligned} \E [ G_t ] &= \E \left[ \sum_{t=0}^\infty \gamma^t R_t \right] = \sum_{t=0}^\infty \gamma^t \E \left[ R_t \right] \\ &= \sum_{t=0}^\infty \gamma^t \sum_s \P\{S_t=s\}r(s) \\ &= \sum_s r(s) \sum_{t=0}^\infty \gamma^t \P\{S_t=s\} \\ &= \sum_s r(s)\eta(s) \\ &= \frac1{1-\gamma} \sum_s r(s)\mu(s) \\ &= \frac1{1-\gamma} \E[ r(S) | S \sim \mu ] \end{aligned}

### Bellman Equation for visitations

\begin{aligned} \eta(s) & \doteq \E \left[ \sum_{t=0}^\infty \gamma^t \1\{S_t=s\} \right] \\ &= \sum_{t=0}^\infty \gamma^t \E \left[ \1\{S_t=s\} \right] \\ &= \sum_{t=0}^\infty \gamma^t \sum_{\bar s} \P\{S_t=\bar s\} \1\{\bar s=s\} \\ &= \sum_{t=0}^\infty \gamma^t \P\{S_t=s\} \\ &= \P\{S_0=s\} + \sum_{t=1}^\infty \gamma^t \P\{S_t=s\} \\ &= h(s) + \sum_{t=1}^\infty \gamma^t \sum_{\bar s} \P\{S_{t-1}=\bar s\} p^\pi(s|\bar s) \\ &= h(s) + \sum_{\bar s} p^\pi(s|\bar s) \sum_{t=1}^\infty \gamma^t \P\{S_{t-1}=\bar s\} \\ \downarrow& \textit{ Change variables with: } k=t-1\\ &= h(s) + \sum_{\bar s} p^\pi(s|\bar s) \sum_{k=0}^\infty \gamma^{k+1} \P\{S_k=\bar s\} \\ &= h(s) + \gamma \sum_{\bar s} \eta(\bar s) p^\pi(s|\bar s) \end{aligned}

### Sum of visitations over all states

\begin{aligned} &\eta(s) = h(s) + \gamma \sum_{\bar s} \eta(\bar s) p^\pi(s|\bar s) \\ \iff& \sum_s \eta(s) = \sum_s h(s) + \sum_s \gamma \sum_{\bar s} \eta(\bar s) p^\pi(s|\bar s) \\ \iff& \sum_s \eta(s) - \gamma \sum_{\bar s} \eta(\bar s) \sum_s p^\pi(s|\bar s) = 1 \\ \iff& \sum_s \eta(s) - \gamma \sum_{\bar s} \eta(\bar s) = 1 \\ \iff& (1-\gamma)\sum_s \eta(s) = 1 \\ \iff& \sum_s \eta(s) = \frac1{1-\gamma} \end{aligned}

### On-policy Distribution

\begin{aligned} \mu(s) &\doteq \frac{\eta(s)}{\sum_{\bar s}\eta(\bar s)} \\ &= (1-\gamma)h(s) + \gamma \sum_{\bar s} \mu(\bar s) p^\pi(s|\bar s) \end{aligned}