The real test for image generators is the image->text->image conversion. In other words it should be able to describe an image with words and then use the words to recreate the original image with a high accuracy. The text representation of the image doesn't have to be English. It can be a program, e.g. a shader, that draws the image. I believe in 5-10 years it will be possible to give this tool a picture of rainforest, tell it to write a shader that draws this forest, and tell it to add Avatar-style flying rocks. Instead of these silly benchmarks, we'll read headlines like "GenAI 5.1 creates a 3D animation of a photograph of the Niagara falls in 3 seconds, less than 4KB of code that runs at 60fps".
Why is that “the real test for image generators”? I mean, most image generators don't inherently include image->text functionality at all, so this seems more of a test of multimodal modals that include both t2i and i2t functionality, but even then, I don't think humans would generally pass this test well (unless the human doing the description test was explicitly told that the purpose was reproduction, but that's not the usual purpose of either human or image2text model descriptions.)