Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Fantastic FAQ, thank you Hamel for writing it up. We had an open space on AI Evals at Pycon this year, and had lots of discussion around similar questions. I only wrote down the questions, however:

# Evaluation Metrics & Methodology

* What metrics do you use (e.g., BERTScore, ROUGE, F1)? Are similarity metrics still useful?

* Do you use step-by-step evaluations or evaluate full responses?

* How do you evaluate VLM (vision-language model) summarization? Do you sample outputs or extract named entities?

* How do you approach offline (ground truth) vs. online evaluation?

* How do you handle uncertainty or "don’t know" cases? (Temperature settings?)

* How do you evaluate multi-turn conversations?

* A/B comparisons and discrete labels (e.g., good/bad) are easier to interpret.

* It’s important to counteract bias toward your own favorite eval questions—ensure a diverse dataset.

## Prompting & Models

* Do you modify prompts based on the specific app being evaluated?

* Where do you store prompts—text files, Prompty, database, or in code?

* Do you have domain experts edit or review prompts?

* How do you choose which model to use?

## Evaluation Infrastructure

* How do you choose an evaluation framework?

* What platforms do you use to gather domain expert feedback or labels?

* Do domain experts label outputs or also help with prompt design?

## User Feedback & Observability

* Do you collect thumbs up / thumbs down feedback?

* How does observability help identify failure modes?

* Do models tend to favor their own outputs? (There's research on this.)

I personally work on adding evaluation to our most popular Azure RAG samples, and put a Textual CLI interface in this repo that I've found helpful for reviewing the eval results: https://github.com/Azure-Samples/ai-rag-chat-evaluator



This is Hamel. Thanks for sharing! I will incorporate these into the FAQ. I love getting additional questions like this.


Any chance you can share what the answers were for choosing an evaluation framework?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: