I wonder why the input is always text - can't it be text, as well as a low quality blender scene with a camera rig flying through space, a moodboard, sketches of the characters etc.?
My guess is because the models were all trained on text. You could do as you say, but I think it would go: blender video {gets described by an AI into text}-> text prompt -> video.