Hello
Ep: 0
- Recap and Notation
- Return (also discounting)
- Value functions
- Bellman Equations
- Relation between Q and V (the same up to a change of variables)
Ep: 1
- Goal of episodic tasks: expected return. The sum of rewards.
- Introducing visitations and on-policy distributions
- (Deriving the Bellman Eq for Visitations and On-Policy Distributions)?
- Expressing the expected return in terms of the on-policy distribution
Ep: 2 Riddle
Ep: 3
- What about continuing tasks? We need to make the sum of rewards finite!
- Discounting: the standard approach
- Can we still express this in terms of an on-policy distribution?
Ep: 4
- Convex Optimization
- LP Formulation
Ep0: Recap
Notation
Expected return
Gtβ=k=t+1βTβRkβ
Value functions
vΟβ(s)βEΟβ[Gtββ£Stβ=s]=EΟβ[k=t+1βTβRkββ£β£β£β£β£β£βStβ=s]
Bellman Equation for vΟβ
vΟβ(s)β==βEΟβ[Gtββ£Stβ=s]=EΟβ[Rt+1β+Gt+1ββ£Stβ=s]r(s)+sΛββpΟ(sΛβ£s)vΟβ(sΛ)β
Ep 1: Bounded Tasks
Bellman Equation for visitations
What is the exact range for the sum? from 0 to Tβ1 I would say.
If you consider STβ to be a special terminal state you donβt need to compute
itβs counts.
Ξ·(s)βββE[t=0βTβ11{Stβ=s}]=t=0βTβE[11{Stβ=s}]=t=0βTβsΛββP{Stβ=sΛ}11{sΛ=s}=t=0βTβP{Stβ=s}=P{S0β=s}+t=1βTβP{Stβ=s}=h(s)+t=1βTβsΛββP{Stβ1β=sΛ}pΟ(sβ£sΛ)=h(s)+sΛββpΟ(sβ£sΛ)t=1βTβP{Stβ1β=sΛ}Β ChangeΒ variablesΒ with:Β k=tβ1=h(s)+sΛββpΟ(sβ£sΛ)k=0βTβP{Skβ=sΛ}=h(s)+sΛββpΟ(sβ£sΛ)Ξ·(sΛ)β
Sum of visitations over all states
This does not work unless the range is until Tβ1
sββΞ·(s)β=sββt=0βTβP{Stβ=s}=t=0βTβsββP{Stβ=s}=t=0βTβ=Tβ
On-policy Distribution
ΞΌ(s)ββsΛβΞ·(sΛ)Ξ·(s)β=TΞ·(s)β=Th(s)β+sΛββpΟ(sβ£sΛ)ΞΌ(sΛ)
Expressing the exptected return in terms of ΞΌ
- Again, careful with index variable in third row
- This is only valid for G0β
E[Gtβ]β=E[t=1βTβRtβ]=t=1βTβE[Rtβ]=t=1βTβsββP{Stβ=s}r(s)=sββr(s)t=1βTβP{Stβ=s}=sββr(s)Ξ·(s)=Tsββr(s)ΞΌ(s)=TE[r(S)β£SβΌΞΌ]β
This is saying that the value of a policy is the average reward over all states
times the number of timesteps.
(If we assume that T is a random variable that changes between episodes
I donβt think the formula above is written correctly)
Ep 3: Discounting
E[Gtβ]β=E[t=0βββΞ³tRtβ]=t=0βββΞ³tE[Rtβ]=t=0βββΞ³tsββP{Stβ=s}r(s)=sββr(s)t=0βββΞ³tP{Stβ=s}=sββr(s)Ξ·(s)=1βΞ³1βsββr(s)ΞΌ(s)=1βΞ³1βE[r(S)β£SβΌΞΌ]β
Bellman Equation for visitations
Ξ·(s)βββE[t=0βββΞ³t11{Stβ=s}]=t=0βββΞ³tE[11{Stβ=s}]=t=0βββΞ³tsΛββP{Stβ=sΛ}11{sΛ=s}=t=0βββΞ³tP{Stβ=s}=P{S0β=s}+t=1βββΞ³tP{Stβ=s}=h(s)+t=1βββΞ³tsΛββP{Stβ1β=sΛ}pΟ(sβ£sΛ)=h(s)+sΛββpΟ(sβ£sΛ)t=1βββΞ³tP{Stβ1β=sΛ}Β ChangeΒ variablesΒ with:Β k=tβ1=h(s)+sΛββpΟ(sβ£sΛ)k=0βββΞ³k+1P{Skβ=sΛ}=h(s)+Ξ³sΛββΞ·(sΛ)pΟ(sβ£sΛ)β
Sum of visitations over all states
βΊβΊβΊβΊβΊβΞ·(s)=h(s)+Ξ³sΛββΞ·(sΛ)pΟ(sβ£sΛ)sββΞ·(s)=sββh(s)+sββΞ³sΛββΞ·(sΛ)pΟ(sβ£sΛ)sββΞ·(s)βΞ³sΛββΞ·(sΛ)sββpΟ(sβ£sΛ)=1sββΞ·(s)βΞ³sΛββΞ·(sΛ)=1(1βΞ³)sββΞ·(s)=1sββΞ·(s)=1βΞ³1ββ
On-policy Distribution
ΞΌ(s)βββsΛβΞ·(sΛ)Ξ·(s)β=(1βΞ³)h(s)+Ξ³sΛββΞΌ(sΛ)pΟ(sβ£sΛ)β