> How do you rate the correctness? Some complex LLM answers seemed to be correct but not in as much detail as the expected answer.
We support two different modes: a strict pass/fail where an answer has to have all of the information we expect, and a rubric based mode where answers are bucketed into things like "partially correct" or "wrong but helpful".
To be honest, we also get the grading wrong sometimes. If you see anything egregious please email me the topic you used at max@talc.ai
> How do you generate the answers? Does the model have access to the original source of truth (like in RAG apps)?
You guessed right here - we connect to the knowledge base like a RAG app. We also use this to generate the questions -- think of it like reading questions out of a textbook to quiz someong.
> And in your examples what model do you actually use?
We use multiple models for the question generation, and are still evaluating what works best. For the demo, we are "quizzing" openai's 3.5 turbo model.
> How do you rate the correctness? Some complex LLM answers seemed to be correct but not in as much detail as the expected answer.
We support two different modes: a strict pass/fail where an answer has to have all of the information we expect, and a rubric based mode where answers are bucketed into things like "partially correct" or "wrong but helpful".
To be honest, we also get the grading wrong sometimes. If you see anything egregious please email me the topic you used at max@talc.ai
> How do you generate the answers? Does the model have access to the original source of truth (like in RAG apps)?
You guessed right here - we connect to the knowledge base like a RAG app. We also use this to generate the questions -- think of it like reading questions out of a textbook to quiz someong.
> And in your examples what model do you actually use?
We use multiple models for the question generation, and are still evaluating what works best. For the demo, we are "quizzing" openai's 3.5 turbo model.