I'm glad to hear that. The main point I try to get across to people regarding Be...

I'm glad to hear that. The main point I try to get across to people regarding Bellman equations is that they are very special-- these sorts of recursive equations allow us to express the value of an observation without knowing the past, and to improve our estimates of a state's value without having to wait for the future to unfold.

In most other situations you're forced to "wait and see" when you want to learn how a given strategy will turn out. This is not the case if you're dealing with an MDP. If the current state is `s`, the next state is `s'`, and the reward you got for transitioning between the two is `r`, then for a given value function V(.) you can express the temporal-difference error (which is sort of a gradient for the value function) as: δ = r + γ v(s') - v(s) ≈ ∂v(s)

Other formulations of rewards/objectives don't tend to permit such elegant constructions, which is why MDPs are so special (and reinforcement learning so successful).

However I feel like it's a struggle getting that point across, so I'm interested in reading your next post to see how you convey things.