Visitations and On-Policy Distributions

Hello

Ep: 0

  • Recap and Notation
  • Return (also discounting)
  • Value functions
  • Bellman Equations
  • Relation between Q and V (the same up to a change of variables)

Ep: 1

  • Goal of episodic tasks: expected return. The sum of rewards.
  • Introducing visitations and on-policy distributions
  • (Deriving the Bellman Eq for Visitations and On-Policy Distributions)?
  • Expressing the expected return in terms of the on-policy distribution

Ep: 2 Riddle

Ep: 3

  • What about continuing tasks? We need to make the sum of rewards finite!
  • Discounting: the standard approach
  • Can we still express this in terms of an on-policy distribution?

Ep: 4

  • Convex Optimization
  • LP Formulation

Ep0: Recap

Notation

Expected return

Gt=βˆ‘k=t+1TRkG_t = \sum_{k=t+1}^T R_k

Value functions

vΟ€(s)≐EΟ€[Gt∣St=s]=EΟ€[βˆ‘k=t+1TRk|St=s]v_\pi(s) \doteq \E_\pi \left[ G_t \mid S_t = s \right] = \E_\pi \left[\sum_{k=t+1}^T R_k \middle| S_t = s \right]

Bellman Equation for vΟ€v_\pi

vΟ€(s)≐EΟ€[Gt∣St=s]==EΟ€[Rt+1+Gt+1∣St=s]=r(s)+βˆ‘sΛ‰pΟ€(sΛ‰βˆ£s)vΟ€(sΛ‰)\begin{aligned} v_\pi(s) \doteq & \E_\pi \left[ G_t \mid S_t = s \right] = \\ = & \E_\pi \left[ R_{t+1}+ G_{t+1} \mid S_t = s \right] \\ = & r(s) + \sum_{\bar s} p^\pi(\bar s|s)v_\pi(\bar s) \end{aligned}

Ep 1: Bounded Tasks

Bellman Equation for visitations

What is the exact range for the sum? from 00 to Tβˆ’1T-1 I would say. If you consider STS_T to be a special terminal state you don’t need to compute it’s counts.

Ξ·(s)≐E[βˆ‘t=0T1 ⁣ ⁣1{St=s}]=βˆ‘t=0TE[1 ⁣ ⁣1{St=s}]=βˆ‘t=0Tβˆ‘sΛ‰P{St=sΛ‰}1 ⁣ ⁣1{sΛ‰=s}=βˆ‘t=0TP{St=s}=P{S0=s}+βˆ‘t=1TP{St=s}=h(s)+βˆ‘t=1Tβˆ‘sΛ‰P{Stβˆ’1=sΛ‰}pΟ€(s∣sΛ‰)=h(s)+βˆ‘sΛ‰pΟ€(s∣sΛ‰)βˆ‘t=1TP{Stβˆ’1=sΛ‰}↓ ChangeΒ variablesΒ with:Β k=tβˆ’1=h(s)+βˆ‘sΛ‰pΟ€(s∣sΛ‰)βˆ‘k=0TP{Sk=sΛ‰}=h(s)+βˆ‘sΛ‰pΟ€(s∣sΛ‰)Ξ·(sΛ‰)\begin{aligned} \eta(s) & \doteq \E \left[ \sum_{t=0}^T \1\{S_t=s\} \right] \\ &= \sum_{t=0}^T \E \left[ \1\{S_t=s\} \right] \\ &= \sum_{t=0}^T \sum_{\bar s} \P\{S_t=\bar s\} \1\{\bar s=s\} \\ &= \sum_{t=0}^T \P\{S_t=s\} \\ &= \P\{S_0=s\} + \sum_{t=1}^T \P\{S_t=s\} \\ &= h(s) + \sum_{t=1}^T \sum_{\bar s} \P\{S_{t-1}=\bar s\} p^\pi(s|\bar s) \\ &= h(s) + \sum_{\bar s} p^\pi(s|\bar s) \sum_{t=1}^T \P\{S_{t-1}=\bar s\} \\ \downarrow& \textit{ Change variables with: } k=t-1\\ &= h(s) + \sum_{\bar s} p^\pi(s|\bar s) \sum_{k=0}^T \P\{S_k=\bar s\} \\ &= h(s) + \sum_{\bar s} p^\pi(s|\bar s) \eta(\bar s) \end{aligned}

Sum of visitations over all states

This does not work unless the range is until Tβˆ’1T-1

βˆ‘sΞ·(s)=βˆ‘sβˆ‘t=0TP{St=s}=βˆ‘t=0Tβˆ‘sP{St=s}=βˆ‘t=0T=T\begin{aligned} \sum_s \eta(s) &= \sum_s \sum_{t=0}^T \P\{S_t=s\} \\ &= \sum_{t=0}^T \sum_s \P\{S_t=s\} \\ &= \sum_{t=0}^T = T \end{aligned}

On-policy Distribution

ΞΌ(s)≐η(s)βˆ‘sΛ‰Ξ·(sΛ‰)=Ξ·(s)T=h(s)T+βˆ‘sΛ‰pΟ€(s∣sΛ‰)ΞΌ(sΛ‰)\mu(s) \doteq \frac{\eta(s)}{\sum_{\bar s}\eta(\bar s)} = \frac{\eta(s)}T = \frac{h(s)}T + \sum_{\bar s} p^\pi(s|\bar s) \mu(\bar s)

Expressing the exptected return in terms of ΞΌ\mu

  • Again, careful with index variable in third row
  • This is only valid for G0G_0
E[Gt]=E[βˆ‘t=1TRt]=βˆ‘t=1TE[Rt]=βˆ‘t=1Tβˆ‘sP{St=s}r(s)=βˆ‘sr(s)βˆ‘t=1TP{St=s}=βˆ‘sr(s)Ξ·(s)=Tβˆ‘sr(s)ΞΌ(s)=TE[r(S)∣S∼μ]\begin{aligned} \E [ G_t ] &= \E \left[ \sum_{t=1}^T R_t \right] = \sum_{t=1}^T \E \left[ R_t \right] \\ &= \sum_{t=1}^T \sum_s \P\{S_t=s\}r(s) \\ &= \sum_s r(s) \sum_{t=1}^T \P\{S_t=s\} \\ &= \sum_s r(s)\eta(s) \\ &= T \sum_s r(s)\mu(s) \\ &= T \E[ r(S) | S \sim \mu ] \end{aligned}

This is saying that the value of a policy is the average reward over all states times the number of timesteps.

(If we assume that T is a random variable that changes between episodes I don’t think the formula above is written correctly)

Ep 3: Discounting

E[Gt]=E[βˆ‘t=0∞γtRt]=βˆ‘t=0∞γtE[Rt]=βˆ‘t=0∞γtβˆ‘sP{St=s}r(s)=βˆ‘sr(s)βˆ‘t=0∞γtP{St=s}=βˆ‘sr(s)Ξ·(s)=11βˆ’Ξ³βˆ‘sr(s)ΞΌ(s)=11βˆ’Ξ³E[r(S)∣S∼μ]\begin{aligned} \E [ G_t ] &= \E \left[ \sum_{t=0}^\infty \gamma^t R_t \right] = \sum_{t=0}^\infty \gamma^t \E \left[ R_t \right] \\ &= \sum_{t=0}^\infty \gamma^t \sum_s \P\{S_t=s\}r(s) \\ &= \sum_s r(s) \sum_{t=0}^\infty \gamma^t \P\{S_t=s\} \\ &= \sum_s r(s)\eta(s) \\ &= \frac1{1-\gamma} \sum_s r(s)\mu(s) \\ &= \frac1{1-\gamma} \E[ r(S) | S \sim \mu ] \end{aligned}

Bellman Equation for visitations

Ξ·(s)≐E[βˆ‘t=0∞γt1 ⁣ ⁣1{St=s}]=βˆ‘t=0∞γtE[1 ⁣ ⁣1{St=s}]=βˆ‘t=0∞γtβˆ‘sΛ‰P{St=sΛ‰}1 ⁣ ⁣1{sΛ‰=s}=βˆ‘t=0∞γtP{St=s}=P{S0=s}+βˆ‘t=1∞γtP{St=s}=h(s)+βˆ‘t=1∞γtβˆ‘sΛ‰P{Stβˆ’1=sΛ‰}pΟ€(s∣sΛ‰)=h(s)+βˆ‘sΛ‰pΟ€(s∣sΛ‰)βˆ‘t=1∞γtP{Stβˆ’1=sΛ‰}↓ ChangeΒ variablesΒ with:Β k=tβˆ’1=h(s)+βˆ‘sΛ‰pΟ€(s∣sΛ‰)βˆ‘k=0∞γk+1P{Sk=sΛ‰}=h(s)+Ξ³βˆ‘sΛ‰Ξ·(sΛ‰)pΟ€(s∣sΛ‰)\begin{aligned} \eta(s) & \doteq \E \left[ \sum_{t=0}^\infty \gamma^t \1\{S_t=s\} \right] \\ &= \sum_{t=0}^\infty \gamma^t \E \left[ \1\{S_t=s\} \right] \\ &= \sum_{t=0}^\infty \gamma^t \sum_{\bar s} \P\{S_t=\bar s\} \1\{\bar s=s\} \\ &= \sum_{t=0}^\infty \gamma^t \P\{S_t=s\} \\ &= \P\{S_0=s\} + \sum_{t=1}^\infty \gamma^t \P\{S_t=s\} \\ &= h(s) + \sum_{t=1}^\infty \gamma^t \sum_{\bar s} \P\{S_{t-1}=\bar s\} p^\pi(s|\bar s) \\ &= h(s) + \sum_{\bar s} p^\pi(s|\bar s) \sum_{t=1}^\infty \gamma^t \P\{S_{t-1}=\bar s\} \\ \downarrow& \textit{ Change variables with: } k=t-1\\ &= h(s) + \sum_{\bar s} p^\pi(s|\bar s) \sum_{k=0}^\infty \gamma^{k+1} \P\{S_k=\bar s\} \\ &= h(s) + \gamma \sum_{\bar s} \eta(\bar s) p^\pi(s|\bar s) \end{aligned}

Sum of visitations over all states

Ξ·(s)=h(s)+Ξ³βˆ‘sΛ‰Ξ·(sΛ‰)pΟ€(s∣sΛ‰)β€…β€ŠβŸΊβ€…β€Šβˆ‘sΞ·(s)=βˆ‘sh(s)+βˆ‘sΞ³βˆ‘sΛ‰Ξ·(sΛ‰)pΟ€(s∣sΛ‰)β€…β€ŠβŸΊβ€…β€Šβˆ‘sΞ·(s)βˆ’Ξ³βˆ‘sΛ‰Ξ·(sΛ‰)βˆ‘spΟ€(s∣sΛ‰)=1β€…β€ŠβŸΊβ€…β€Šβˆ‘sΞ·(s)βˆ’Ξ³βˆ‘sΛ‰Ξ·(sΛ‰)=1β€…β€ŠβŸΊβ€…β€Š(1βˆ’Ξ³)βˆ‘sΞ·(s)=1β€…β€ŠβŸΊβ€…β€Šβˆ‘sΞ·(s)=11βˆ’Ξ³\begin{aligned} &\eta(s) = h(s) + \gamma \sum_{\bar s} \eta(\bar s) p^\pi(s|\bar s) \\ \iff& \sum_s \eta(s) = \sum_s h(s) + \sum_s \gamma \sum_{\bar s} \eta(\bar s) p^\pi(s|\bar s) \\ \iff& \sum_s \eta(s) - \gamma \sum_{\bar s} \eta(\bar s) \sum_s p^\pi(s|\bar s) = 1 \\ \iff& \sum_s \eta(s) - \gamma \sum_{\bar s} \eta(\bar s) = 1 \\ \iff& (1-\gamma)\sum_s \eta(s) = 1 \\ \iff& \sum_s \eta(s) = \frac1{1-\gamma} \end{aligned}

On-policy Distribution

ΞΌ(s)≐η(s)βˆ‘sΛ‰Ξ·(sΛ‰)=(1βˆ’Ξ³)h(s)+Ξ³βˆ‘sΛ‰ΞΌ(sΛ‰)pΟ€(s∣sΛ‰)\begin{aligned} \mu(s) &\doteq \frac{\eta(s)}{\sum_{\bar s}\eta(\bar s)} \\ &= (1-\gamma)h(s) + \gamma \sum_{\bar s} \mu(\bar s) p^\pi(s|\bar s) \end{aligned}