It's trained on pre-2021 data. Looks like they tested on the most recent tests (i.e. 2022-2023) or practice exams. But yeah standardized tests are heavily weighed towards pattern matching, which is what GPT-4 is good at, as shown by its failure at the hindsight neglect inverse-scaling problem.
I believe they showed that in GPT4 reversed the trend on the hindsight neglect problem. Search for "hindsight neglect" in the website and you can see that it's accuracy on the problem shot up to 100%.