Where is the training material for this coming from? The only resource I can think of that's broad enough for a general purpose video model is YouTube, but I can't imagine Google would allow a third party to scrape all of YT without putting up a fight.
You can still have a broad dataset and use RLHF to steer it more towards the aesthetic like midjourney and SDXL did through discord feedback. I think there was still some aesthetic selection in the dataset as well but it still included a lot of crap.