I have a question for folks working heavily with AI blackboxes related to this - what are methods that companies use to test the quality of outputs? Testing the integration itself can be treated pretty much the same as testing around any third-party service, but what I've seen are some teams using models to test the output quality of models... which doesn't seem great instinctively
Take this with a grain of salt because I haven't done it myself, but I would treat this the same as testing anything that uses some element of random.
If you're writing a random number generator, that generates numbers between 0 and 100. How would you test it? Throw your hands up in the air and say nope, can't test it, it's not deterministic! Or maybe you can just run it 1000 times and make sure all the numbers are indeed between 0 and 100. Maybe count up the number frequencies and verify its uniform. There's lots of things you can check for.
So do the same with your LLMs. Test it on your specific use-cases. Do some basic smoke tests. Are you asking it yes or no questions? Is it responding with yes or no? Try some of your prompts on it, get a feel for what it outputs, write some regexes to verify the outputs stay sane when there's a model upgrade.
For "quality" I don't think there's a substitute than humans. Just try it. If the outputs feel good, add your unit tests. If you want to get scientific, do blind tests with different models and have humans rate them.