So for context our app (https://nativi.sh) is a language correction app. It takes in text and cleans it up to make it sound more fluent/correct, it's basically geared towards being grammarly for your second language.
For some of our deterministic LLM tests, we have inputs that have known spelling errors but no wrong word errors, or some other combination of errors. If the config under test doesn't identify the issue, or identifies issues that we know aren't there then it's marked as being wrong for that test case. Then we can test across config x language x kind_of_error.
For the LLM vibe driven scoring we have it set up to just do a head to head between the current leading config (usually what's in prod) and the new candidate config rather than generating an abstract score. It will flag "x config straight up failed question N based on some_reason(s)" so that we can manually check it.
My partner wrote the testing framework. She's been thinking about cleaning it up and open sourcing it.
"7 likes / no comments" --> should I read it as:
people interested in others people experience, but have nothing to share about their own?
- No prompt on production?
- No testing or other routines about it yet?
Between HN and Product Hunt on the same day we took in about 8,000 unique visitors, which resulted in over 350 signups.
Big drop-off since that early traffic spike, but is look like 40% of my traffic is still returning users. So that is interesting...
That traffic spike exposed a few big bugs which we closed this week, and now I'm figuring out next steps (marketing automation, more user acquisition, increasing sharing/virality).
Also, the more users I talk to the more I understand their use cases.
All in all, I'm loving it despite juggling this and my day job :-)
kinda like my own personal HN(lol)...I post links with hash tags in the title. I added a login but you can post anonymously. I just haven't worked on it in ages.
I could see it being used. Lots of people get hundreds of emails a day being pitched to or asking for help. Charging people (even a small amount) might get people to put more effort into their emails.
and what about llm-scoring: does LLM output passed/not_passed or it is more?