I agree too. My impression is that almost all RAG tutorials _only_ talk about vector DBs, when these are not strictly required for Retrieval Augmented Generation. I'm guessing vector DBs are useful when you have massive amounts of documents on diverse topics.
Some gotchas I experienced (but I might be using the wrong embedding/vector DB: spaCy/FAISS):
- Short user questions might result a low signal query vector, e. g. user : "Who is Keanu Reeves?" -> false positives on Wikipedia articles which only contain "Who is"
- Typos and formatting affects the vectorization, a small difference might lead to a miss, e.g. "Who is Keanu Reeves?" -> match, "Who is keanu Reeves?" -> no match, no match with any other capitalization.
If there's only a single document, a simple keyword search might lead to better results.
In my experience, false positives (retrieving an irrelevant text and generating completely wrong answer) are a bigger problem than negatives (not retrieving text, possibly can't answer question).
Has somebody experience with Apache Lucene / Solr or Elasticsearch?
> "Has somebody experience with Apache Lucene / Solr or Elasticsearch?"
I've been working on a RAG with Solr, and quickly hit some of the issues you describe when dealing with real-world messy data and user input, e.g. using all-MiniLM-L6-v2 and cosine similarity, "Can you summarize Immanuel Kant's biography?" matched a chunk containing just the word "Biography" rather than one which started "Immanuel Kant, born in 1724...", and "How high is Ben Nevis?" matched a chunk of text about someone called Benjamin rather than a chunk about mountains containing the words "Ben Nevis" and its height[0]. Switching embedding model has helped, but still not convinced that vector search alone is the silver bullet some claim it is. Still lots more to try though, e.g. hybrid search[1], query expansion[2], knowledge graphs etc.
Exactly in the same place as you with Elastic Search (8.11). Went down the vector path to get better matches for adjectives, verbs and negations ( "room with no skylight" vs. "room with skylights" & "room with a large skylight"). Different dataset obviously, but I think I get slightly better results than your examples and it might be worth looking for a different sentence transformer (I tried a few and settled on roberta-base-nli-stsb-mean-tokens).
was reading through open llama, looks like
way to get pertinent results is via different ranking algorithm and score based on convergence. then shove that back into the LLM
If you know that your search queries will be actual questions (like in the example you listed), you can possibly use the HyDE[0] to create a hypothetical answer which will usually have an embedding that's closer to the RAG chunks you are looking for.
It has the downside that an LLM (rather than just a embedding model) is used in the query path, but it has helped me multiple times in the past to strongly reduce problems with RAG like the ones you outlined, where it likes to latch onto individual words.
Thanks, sounds interesting, not-dissimilar from some of the query expansion techniques. But in my case (open source, zero budget) I'm doing (slow) CPU inference, so an LLM in the query chain isn't really viable. As it is there is a near-instant "Source: [url]" returned by the vector search, followed by the LLM-generated "answer" (quite some time) later. So I think next steps will be "traditional" techniques such as query re-ranking and hybrid search, in line with the original "Build a search engine, not a vector DB" article.
Lucene supports decompounding and stemming, https://core.ac.uk/reader/154370300 depending on the language decompounding can be very important or of little import, Germanic languages should probably have decompounding.
Ignoring the disclosure etiquette here, then making an irrelevant rebuttal about relevance when the point was disclosure, then getting snarky with the person who tried to helpfully point it out?
I have no opinion on your products or your post, but some % of people steer away from companies for such things.
My views are my own and as such I do not disclose my employment or otherwise on here.
I did think twice about posting it, as I don't usually but it's relevant and i might be helpful so why not? If you don't like it, thanks for the downvote.
Wow. I learned some stuff about etiquette on HN today.
I'll support you, mnd999. I don't work for a graph dB company. We don't use graph dBs, but I'm considering it. Graph dbs are a legitimate source to feed data I to your RAG system. Our RAG system currently used hybrid search: lexical and semantic. We need to expand our sources, too. I would like to see us use LLMs to rephrase our content (we have a lot of code), and index on that. I think we should build a KG on content quality (we have millions of docs) and software out the things no one likes.
I also think a KG on "learning journeys" would be valuable, but really difficult.
Some gotchas I experienced (but I might be using the wrong embedding/vector DB: spaCy/FAISS):
- Short user questions might result a low signal query vector, e. g. user : "Who is Keanu Reeves?" -> false positives on Wikipedia articles which only contain "Who is"
- Typos and formatting affects the vectorization, a small difference might lead to a miss, e.g. "Who is Keanu Reeves?" -> match, "Who is keanu Reeves?" -> no match, no match with any other capitalization.
If there's only a single document, a simple keyword search might lead to better results.
In my experience, false positives (retrieving an irrelevant text and generating completely wrong answer) are a bigger problem than negatives (not retrieving text, possibly can't answer question).
Has somebody experience with Apache Lucene / Solr or Elasticsearch?