a big conceptual point in RL is the focus on the Bellman equation. value of a state equals immediate reward plus discounted future value. if you know the value of every state, just always move to pick the highest value.
well known methods like Q-learning are basically just iterative, approximate methods to find solutions to the Bellman equation — i.e. a measure of value for every state of the world, such that the Bellman equation is satisfied.
policy optimization methods don’t do this, but there are still mathematical connections back to the Bellman equation (there is a duality relationship between value functions and policies).
I would say this focus is a big part of what makes the field of RL unique.
well known methods like Q-learning are basically just iterative, approximate methods to find solutions to the Bellman equation — i.e. a measure of value for every state of the world, such that the Bellman equation is satisfied.
policy optimization methods don’t do this, but there are still mathematical connections back to the Bellman equation (there is a duality relationship between value functions and policies).
I would say this focus is a big part of what makes the field of RL unique.