The biggest one is that it's hard to get "zero matches" from an embeddings datab...

kgeist · 2024-07-25T05:41:27 1721886087

>but because embeddings search orders by similarity score it will ALWAYS return results, really scraping the bottom of the barrel if it has to

Why not have a similarity threshold? Say, if the distance is below 0.7, do not accept the search result.

simonw · 2024-07-25T06:24:22 1721888662

It turns out picking that threshold is extremely difficult - I've tried! The value seems to differ for different searches, so picking eg 0.7 as a fixed value isn't actually as useful as you would expect.

zmccormick7 · 2024-07-25T15:08:31 1721920111

Agreed that thresholds don't work when applied to the cosine similarity of embeddings. But I have found that the similarity score returned by high-quality rerankers, especially Cohere, are consistent and meaningful enough that using a threshold works well there.

kgeist · 2024-07-29T10:04:04 1722247444

I use similarity threshold (to remove absolutely irrelevant results) and then use a reranker to get Top N.

jairuhme · 2024-07-25T15:11:54 1721920314

I'll add to what the other commenter noted, but sometimes the difference between results get very granular (i.e. .65789 vs .65788) so deciding on where that threshold should be is little trickier.