So for context our app (https://nativi.sh) is a language correction app. It take...

So for context our app (https://nativi.sh) is a language correction app. It takes in text and cleans it up to make it sound more fluent/correct, it's basically geared towards being grammarly for your second language.

For some of our deterministic LLM tests, we have inputs that have known spelling errors but no wrong word errors, or some other combination of errors. If the config under test doesn't identify the issue, or identifies issues that we know aren't there then it's marked as being wrong for that test case. Then we can test across config x language x kind_of_error.

For the LLM vibe driven scoring we have it set up to just do a head to head between the current leading config (usually what's in prod) and the new candidate config rather than generating an abstract score. It will flag "x config straight up failed question N based on some_reason(s)" so that we can manually check it.

My partner wrote the testing framework. She's been thinking about cleaning it up and open sourcing it.