# A Reinforcement Learning Riddle

I proved $1=0$ starting from the formula for the *on-policy distribution in
episodic tasks*. Obviously there is a mistake, can you spot it? π€.

## The *proof*

From page 199 of Suttonβs book:

Let $h(s)$ denote the probability that an episode begins in each state $s$, and let $\eta(s)$ denote the number of time steps spent, on average, in state $s$ in a single episode. Time is spent in a sate $s$ if episodes start in $s$, or if transitions are made into $s$ from a preceding state $\bar s$ in which time is spent:

$\eta(s) = h(s) + \sum_{\bar s} \eta(\bar s) \sum_a \pi(a|\bar s)p(s|\bar s,a), \text{ for all }s\in\mathcal S.$This system of equations can be solved for the expected number of visits $\eta(s)$. The on-policy distribution is then the fraction of time spent in each state normalized to sum to one:

$\mu(s) = \frac{\eta(s)}{\sum_{\bar s}\eta(\bar s)}, \text{ for all }s\in\mathcal S.$

Where the denominator in the previous formula is simply $T$, the number of time steps in the episode.

To make our formulas a little shorter let us define $p^\pi(s|\bar s)$ as the probability of going from state $\bar s$ to state $s$ under policy $\pi$:

Then:

**Wat.**
Can you spot the problem with the previous equations? Try to figure it out or
keep reading to know the answer.

## The problem

To see what is going on, let us take a very simple problem, with two states, $A$ and $B$, where the latter is terminal. Starting from $A$, under $p^\pi$, we stay in $A$ with probability $\alpha$, and move to $B$ with probability $1-\alpha$.

Note that the number of timesteps $T$ of an episode is not fixed, because the problem is not deterministic, but it is finite, because we are always going to reach $B$ at some point.

Let us assume that all episodes start in $A$ and compute $\eta$ for each state:

This is where the problem begins. What is the value of $p^\pi(B|B)$?

## First attempt $(T<\infty)$

We know that all episodes terminate as soon as we reach $B$. Therefore, at least intuitively, we would like the value of the average count $\eta(B)$ to be $1$, since we are always going to be in $B$ exactly once.

The only way to make this happen, starting from our previous formula,

is by defining the probability of going from $B$ to $B$ to be $0$:

This seems to work nicely, both $\eta(A)$ and $\eta(B)$ are what we expect. Then, why is the formula at the beginning of the article not working?

β¦

Because our probabilities do not add up! The sum of the probabilities of going from $B$ to every other state should be $1$, but is actually 0:

This, quite obviously, makes our original formula break when we assume

Umhh β¦ can we find an other way?

## Second attempt $(T=\infty)$

Let us try with an other option. If we consider $B$ to be an absorbing state, then $p^\pi(B|B)$ should be $1$. However, if we substitute that value we obtain something weird:

To see what is going on remember that $\eta(s)$ represents
*βthe number of time steps spent, on average, in state $s$β*.
By definition, when reaching an absorbing state, like $B$, we stay there forever.
Thus, as soon as we reach $B$, we start looping, increasing indefinitely both
the total number of timesteps $T$ and our count $\eta(B).$

This can be seen by taking the limit for $p^\pi(B|B)=1$ in our previous formula:

Well, then, our formula is not wrong β¦ at least in the limit. However, our average count $\eta(B)$ goes to infinity, making our on-policy distribution $\mu(s)$ not so useful anymore. This is because $B$ would take all the weight, leading to $\mu(B)=1$, and $\mu(s)=0$ everywhere else.

This outcome too, is not satisfying. So, what should we conclude from all this?

## Conclusion

- In our first case, where the number of timesteps $T$ is finite, we cannot treat terminal states like every other state, otherwise their probabilities would not add up.
- Similarly, in the second case, with infinite timesteps $T$, we cannot treat absorbing states like every other state (unless we introduce discounting), otherwise the expected number of visits $\eta(s)$ becomes infinite.

One solution that works for both cases, is to partition the states in two sets $\mathcal S$, containing all states, except terminal states, and $\mathcal S^+$ containing every state, including terminal ones. Then, rewrite our original formula as:

Finally, let us revisit our original *proof*:

This is saying that, on average, in an episode, we are going to spend one timestep in terminal states.

It works! π

Thatβs it, in the end we did not manage to prove $1=0$, but hopefully we learned something interesting along the way!