I proved 1=0 starting from the formula for the on-policy distribution in
episodic tasks. Obviously there is a mistake, can you spot it? π€.
The proof
From page 199 of Suttonβs book:
Let h(s) denote the probability that an episode begins in each state s, and
let Ξ·(s) denote the number of time steps spent, on average, in state s
in a single episode. Time is spent in a sate s if episodes start in s, or
if transitions are made into s from a preceding state sΛ in which time
is spent:
This system of equations can be solved for the expected number of visits Ξ·(s).
The on-policy distribution is then the fraction of time spent in each state
normalized to sum to one:
ΞΌ(s)=βsΛβΞ·(sΛ)Ξ·(s)β,Β forΒ allΒ sβS.
Where the denominator in the previous formula is simply T, the number of time steps
in the episode.
To make our formulas a little shorter let us define pΟ(sβ£sΛ) as the probability
of going from state sΛ to state s under policy Ο:
Wat.
Can you spot the problem with the previous equations? Try to figure it out or
keep reading to know the answer.
The problem
To see what is going on, let us take a very simple problem, with two states,
A and B, where the latter is terminal. Starting from A, under pΟ,
we stay in A with probability Ξ±, and move to B with probability
1βΞ±.
Note that the number of timesteps T of an episode is not fixed, because the
problem is not deterministic, but it is finite, because we are always going
to reach B at some point.
Let us assume that all episodes start in A and compute Ξ· for each state:
This is where the problem begins. What is the value of pΟ(Bβ£B)?
First attempt (T<β)
We know that all episodes terminate as soon as we reach B.
Therefore, at least intuitively, we would like the value of the average count Ξ·(B)
to be 1, since we are always going to be in B exactly once.
The only way to make this happen, starting from our previous formula,
Ξ·(B)=1+Ξ·(B)pΟ(Bβ£B)
is by defining the probability of going from B to B to be 0:
Ξ·(B)=1+Ξ·(B)0=1
This seems to work nicely, both Ξ·(A) and Ξ·(B) are what we expect. Then,
why is the formula at the beginning of the article not working?
β¦
Because our
probabilities do not add up! The sum of the probabilities of going from B to
every other state should be 1, but is actually 0:
sββpΟ(sβ£B)=pΟ(Aβ£B)+pΟ(Bβ£B)=0+0=0
This, quite obviously, makes our original formula break when we assume
sββpΟ(sβ£sΛ)=1.
Umhh β¦ can we find an other way?
Second attempt (T=β)
Let us try with an other option. If we consider B to be an absorbing state,
then pΟ(Bβ£B) should be 1. However, if we substitute that value we obtain
something weird:
Ξ·(B)=1βpΟ(Bβ£B)1β=01β
To see what is going on remember that Ξ·(s) represents
βthe number of time steps spent, on average, in state sβ.
By definition, when reaching an absorbing state, like B, we stay there forever.
Thus, as soon as we reach B, we start looping, increasing indefinitely both
the total number of timesteps T and our count Ξ·(B).
This can be seen by taking the limit for pΟ(Bβ£B)=1 in our previous formula:
pΟ(Bβ£B)β1+limβ1βpΟ(Bβ£B)1β=+β
Well, then, our formula is not wrong β¦ at least in the limit. However, our
average count Ξ·(B) goes to infinity, making our on-policy distribution
ΞΌ(s) not so useful anymore. This is because B would take all the
weight, leading to ΞΌ(B)=1, and ΞΌ(s)=0 everywhere else.
This outcome too, is not satisfying. So, what should we conclude from all this?
Conclusion
In our first case, where the number of timesteps T is finite, we cannot
treat terminal states like every other state, otherwise their probabilities
would not add up.
Similarly, in the second case, with infinite timesteps T, we cannot treat
absorbing states like every other state (unless we introduce discounting),
otherwise the expected number of visits Ξ·(s) becomes infinite.
One solution that works for both cases, is to partition the states in two
sets S, containing all states, except terminal states, and S+
containing every state, including terminal ones. Then, rewrite our original
formula as: