A Reinforcement Learning Riddle
I proved starting from the formula for the on-policy distribution in episodic tasks. Obviously there is a mistake, can you spot it? 🤔.
From page 199 of Sutton's book:
Let denote the probability that an episode begins in each state , and let denote the number of time steps spent, on average, in state in a single episode. Time is spent in a sate if episodes start in , or if transitions are made into from a preceding state in which time is spent:
This system of equations can be solved for the expected number of visits . The on-policy distribution is then the fraction of time spent in each state normalized to sum to one:
Where the denominator in the previous formula is simply , the number of time steps in the episode.
To make our formulas a little shorter let us define as the probability of going from state to state under policy :
Wat. Can you spot the problem with the previous equations? Try to figure it out or keep reading to know the answer.
To see what is going on, let us take a very simple problem, with two states, and , where the latter is terminal. Starting from , under , we stay in with probability , and move to with probability .
Note that the number of timesteps of an episode is not fixed, because the problem is not deterministic, but it is finite, because we are always going to reach at some point.
Let us assume that all episodes start in and compute for each state:
This is where the problem begins. What is the value of ?
We know that all episodes terminate as soon as we reach . Therefore, at least intuitively, we would like the value of the average count to be , since we are always going to be in exactly once.
The only way to make this happen, starting from our previous formula,
is by defining the probability of going from to to be :
This seems to work nicely, both and are what we expect. Then, why is the formula at the beginning of the article not working?
Because our probabilities do not add up! The sum of the probabilities of going from to every other state should be , but is actually 0:
This, quite obviously, makes our original formula break when we assume
Umhh ... can we find an other way?
Let us try with an other option. If we consider to be an absorbing state, then should be . However, if we substitute that value we obtain something weird:
To see what is going on remember that represents "the number of time steps spent, on average, in state ". By definition, when reaching an absorbing state, like , we stay there forever. Thus, as soon as we reach , we start looping, increasing indefinitely both the total number of timesteps and our count
This can be seen by taking the limit for in our previous formula:
Well, then, our formula is not wrong ... at least in the limit. However, our average count goes to infinity, making our on-policy distribution not so useful anymore. This is because would take all the weight, leading to , and everywhere else.
This outcome too, is not satisfying. So, what should we conclude from all this?
- In our first case, where the number of timesteps is finite, we cannot treat terminal states like every other state, otherwise their probabilities would not add up.
- Similarly, in the second case, with infinite timesteps , we cannot treat absorbing states like every other state (unless we introduce discounting), otherwise the expected number of visits becomes infinite.
One solution that works for both cases, is to partition the states in two sets , containing all states, except terminal states, and containing every state, including terminal ones. Then, rewrite our original formula as:
Finally, let us revisit our original proof:
This is saying that, on average, in an episode, we are going to spend one timestep in terminal states.
It works! 🎉
That's it, in the end we did not manage to prove , but hopefully we learned something interesting along the way!