Hacker News new | past | comments | ask | show | jobs | submit login

I'm glad to hear that. The main point I try to get across to people regarding Bellman equations is that they are very special-- these sorts of recursive equations allow us to express the value of an observation without knowing the past, and to improve our estimates of a state's value without having to wait for the future to unfold.

In most other situations you're forced to "wait and see" when you want to learn how a given strategy will turn out. This is not the case if you're dealing with an MDP. If the current state is `s`, the next state is `s'`, and the reward you got for transitioning between the two is `r`, then for a given value function V(.) you can express the temporal-difference error (which is sort of a gradient for the value function) as: δ = r + γ v(s') - v(s) ≈ ∂v(s)

Other formulations of rewards/objectives don't tend to permit such elegant constructions, which is why MDPs are so special (and reinforcement learning so successful).

However I feel like it's a struggle getting that point across, so I'm interested in reading your next post to see how you convey things.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: