It would be very interesting indeed to have an ebook reader paired with bluetooth earphones, and it simultaneously feeds the words into this to make an ambient soundtrack, perhaps also choosing music appropriate to the word-choice on the page.
I found them very unsettling. My brain is trying so hard to resolve words from that mess. This is the first time I’ve really thought about how the uncanny valley applies to spoken words.
You can already achieve that by combining models - use a dedicated speech synthesis model for the narration, then layer that over background effects from AudioGen.
Given that, I don't think AudioGen particularly needs to add full narration. That seems like a very different problem to me, likely requiring a completely different architecture.