This strikes me as kind of ironic -- you'd think a language model would do better on questions like essay prompts and multiple choice reading comprehension questions regarding passages than it would in calculations. I wonder if there are more details about these benchmarks somewhere, so we can see what's actually happening in these cases.
I don't find it ironic, because a language model is (currently?) the wrong tool for the job. When you are asked to write an essay, the essay itself is a byproduct. Of course it should be factually and grammatically correct, but that's not the point. The real task is forming a coherent argument and expressing it clearly. And ideally also making it interesting and convincing.