Why is that “the real test for image generators”? I mean, most image generators don't inherently include image->text functionality at all, so this seems more of a test of multimodal modals that include both t2i and i2t functionality, but even then, I don't think humans would generally pass this test well (unless the human doing the description test was explicitly told that the purpose was reproduction, but that's not the usual purpose of either human or image2text model descriptions.)