# A Reinforcement Learning Riddle

I proved $1=0$ starting from the formula for the *on-policy distribution in
episodic tasks*. Obviously there is a mistake, can you spot it? π€.

*proof*

The From page 199 of Sutton's book:

Let $h(s)$ denote the probability that an episode begins in each state $s$, and let $\eta(s)$ denote the number of time steps spent, on average, in state $s$ in a single episode. Time is spent in a sate $s$ if episodes start in $s$, or if transitions are made into $s$ from a preceding state $\bar s$ in which time is spent:

$\eta(s) = h(s) + \sum_{\bar s} \eta(\bar s) \sum_a \pi(a|\bar s)p(s|\bar s,a), \text{ for all }s\in\mathcal S.$This system of equations can be solved for the expected number of visits $\eta(s)$. The on-policy distribution is then the fraction of time spent in each state normalized to sum to one:

$\mu(s) = \frac{\eta(s)}{\sum_{\bar s}\eta(\bar s)}, \text{ for all }s\in\mathcal S.$

Where the denominator in the previous formula is simply $T$, the number of time steps in the episode.

To make our formulas a little shorter let us define $p^\pi(s|\bar s)$ as the probability of going from state $\bar s$ to state $s$ under policy $\pi$:

Then:

**Wat.**
Can you spot the problem with the previous equations? Try to figure it out or
keep reading to know the answer.

## The problem

To see what is going on, let us take a very simple problem, with two states, $A$ and $B$, where the latter is terminal. Starting from $A$, under $p^\pi$, we stay in $A$ with probability $\alpha$, and move to $B$ with probability $1-\alpha$.

Note that the number of timesteps $T$ of an episode is not fixed, because the problem is not deterministic, but it is finite, because we are always going to reach $B$ at some point.

Let us assume that all episodes start in $A$ and compute $\eta$ for each state:

This is where the problem begins. What is the value of $p^\pi(B|B)$?

## $(T<\infty)$

First attemptWe know that all episodes terminate as soon as we reach $B$. Therefore, at least intuitively, we would like the value of the average count $\eta(B)$ to be $1$, since we are always going to be in $B$ exactly once.

The only way to make this happen, starting from our previous formula,

is by defining the probability of going from $B$ to $B$ to be $0$:

This seems to work nicely, both $\eta(A)$ and $\eta(B)$ are what we expect. Then, why is the formula at the beginning of the article not working?

...

Because our probabilities do not add up! The sum of the probabilities of going from $B$ to every other state should be $1$, but is actually 0:

This, quite obviously, makes our original formula break when we assume

Umhh ... can we find an other way?

## $(T=\infty)$

Second attemptLet us try with an other option. If we consider $B$ to be an absorbing state, then $p^\pi(B|B)$ should be $1$. However, if we substitute that value we obtain something weird:

To see what is going on remember that $\eta(s)$ represents
*"the number of time steps spent, on average, in state $s$"*.
By definition, when reaching an absorbing state, like $B$, we stay there forever.
Thus, as soon as we reach $B$, we start looping, increasing indefinitely both
the total number of timesteps $T$ and our count $\eta(B).$

This can be seen by taking the limit for $p^\pi(B|B)=1$ in our previous formula:

Well, then, our formula is not wrong ... at least in the limit. However, our average count $\eta(B)$ goes to infinity, making our on-policy distribution $\mu(s)$ not so useful anymore. This is because $B$ would take all the weight, leading to $\mu(B)=1$, and $\mu(s)=0$ everywhere else.

This outcome too, is not satisfying. So, what should we conclude from all this?

## Conclusion

- In our first case, where the number of timesteps $T$ is finite, we cannot treat terminal states like every other state, otherwise their probabilities would not add up.
- Similarly, in the second case, with infinite timesteps $T$, we cannot treat absorbing states like every other state (unless we introduce discounting), otherwise the expected number of visits $\eta(s)$ becomes infinite.

One solution that works for both cases, is to partition the states in two sets $\mathcal S$, containing all states, except terminal states, and $\mathcal S^+$ containing every state, including terminal ones. Then, rewrite our original formula as:

Finally, let us revisit our original *proof*:

This is saying that, on average, in an episode, we are going to spend one timestep in terminal states.

It works! π

That's it, in the end we did not manage to prove $1=0$, but hopefully we learned something interesting along the way!