This is a spot on point. My prediction is that it wouldn't be able to. Given its difficulty to generate correct counts of glasses, it seems as though it still struggles with systematic generalization and compositionality. As a point of reference, cherrypicking aside, it could model obscure but probably well-defined baby daikon radish in tutu walking dog, but couldn't model red on green on blue cubes. Maybe more sequential perception, action, video data or system-2 like paradigm, but it remains to be seen.