I wonder whether the problem could even become sufficiently well defined to admit any agreed upon loss function? You must debate with the goal of maximising the aggregate wellbeing (definition required) of all living and future humans (and other relatable species)?
It would require some sort of continuously tuned arbiter, ie. similar to in RLHF as well as an adversarial-style scheme a la GAN. But I really am spitballing here - research could absolutely go in a different direction.
But lets say you reduced it to some sort of 'trying to prove a statement' that can be verified along with a discriminator model, then compare two iterations based on whether they are accurately proving the statement in english language.