My take is more that the framing of maximizing an externally provided reward describing a single task only helps an agent who can rely on the rules of that task remaining fixed. Humans don't just try to do one task well; we switch between tasks. We don't have episodes that reset really; we have one long episode, and our actions decades ago can have bearing on our well-being today. What is a good state or a good policy varies substantially over time. The concept of the value of a state or (state, action) is ambiguous, because who knows what you'll want tomorrow? If the definition of value shifts, but the rules of the world don't, then a modeling becomes more "valuable".
Also, now that I think of it, although we often talk about how much more data-efficient humans are compared to ML models, we could say that we're actually much more data-inefficient than most (or all) other animals. Our gestation period is long and then we're totally useless as a baby. We require a lot of learning and development early on in order to reap the benefits later (which consist of generality and flexibility like you mentioned and also, for example, being able to generate our own synthetic data via imagination and counterfactual reasoning).