Hacker News new | past | comments | ask | show | jobs | submit login

Yeah exactly, existing benchmark datasets available are underutilized (eg KILT, Natural questions, etc.).

But it is only natural that different QA use cases require different strategies. I built 3 production RAG systems / virtual assistant now, and 4 that didn't make it past PoC and what advanced techniques works really depends on document type, text content and genre, use case, source knowledgebase structure and metadata to exploit etc.

Current go-to is semantic similarity chunking (with overlap) + title or question generation > retriever with fusion on bienc vector sim + classic bm25 + condensed question reformulated QA agent. If you don't get some decent results with that setup there is no hope.

For every project we start the creation of a use-case eval set immediately in parallel with the actual RAG agent, but sometimes the client doesn't think this is priority. We convinced them all it's highly important though, because it is.

Having an evaluation set is doubly important in GenAI projects: a generative system will do unexpected things and an objective measure is needed. Your client will run into weird behaviour when testing and they will get hung up on a 1-in-100 undesirable generation.




How do you weight results between vector search and bm25? Do you fall back to bm25 when vector similarity is below a threshold, or maybe you tweak the weights by hand for each data set?


The algorithm I use to get a final ranking from multiple rankings is called "reciprocal ranked fusion". I use the implementation described here: https://docs.llamaindex.ai/en/stable/examples/low_level/fusi...

Which is the implementation from the original paper.


Thanks, much appreciated!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: