We're not likely to see one. DALL-E was trained by analyzing pixels along with the captions for images. It is the caption information which is used in understanding your prompts.
But music doesn't have a form that ties the audio portion so cleanly to a textual description. And when it does, those labels tend to be overly simplified and not really helpful. Music is, in fact, hard to describe.
One project that might interest you is Every Noise at Once[0], which does an amazing job of grouping known artists and songs by their sonic similarly, which results in similar style and listener appeal.
It describes how a person might play it, that's all. It won't help anyone identify a song they've heard or pick one out by genre or style. That information isn't really "encoded" in sheet music.
But music doesn't have a form that ties the audio portion so cleanly to a textual description. And when it does, those labels tend to be overly simplified and not really helpful. Music is, in fact, hard to describe.
One project that might interest you is Every Noise at Once[0], which does an amazing job of grouping known artists and songs by their sonic similarly, which results in similar style and listener appeal.
[0] https://everynoise.com/