But don't we already know that composition exists in DALL-E? Don't the points shown in the tweet indicate that some form of composition exists? The 3D renders are clearly render-like, the painting and cartoons are clearly in the appropriate style.
"That there exist rules of composition of the hypothesized secret DALL-E language" is a much stronger claim than that it "understands" composition of text in the real languages it was trained on.
Though I'll also point out that even evidence for that weaker claim is tenuous. It definitely knows how to move an image closer to "3D render" in concept-space, but it doesn't seem to understand the linguistic composition of your request. For example, you'd have an extremely hard time getting it to generate an image of a person using 3D rendering software, or a "person in any style that isn't 3D render"; it would probably just make 3D renders of persons.
I haven't played around with it myself, I'm going off the experiences of others. For example: