The problem with BM25 in a database, is that is can have unexpected outcomes for some common use cases.
Take multi-tenancy.
What if user 1 has many more documents than user 2, and uses "new orleans" a lot. But user 2 does not. User 2 does the search.
The db will first use FTS, and then filter. So user 1 will bias the results of user 2. Perhaps enough for user 2 to discover what words are in user 1 corpus.
Really, use some proven and mature tech for your data. Don't jump on every hype train. In case of SQLite self hosting is especially easy using https://litestream.io/.
AWS offers a solution which is OpenSearch serverless collections. A dedicated collection can be spun up for each user/tenant, e.g. «thing1-user345», «thing2-user123», instead of co-mingling the «user123» and «user345» in the same collection index. It increases the overall overhead but with the infrastructure repeatability, courtesy of IaaC, it is easy to roll out and process/consume «thing1-user123» and «thing2-user123» as discrete datasets.
Tangentially related. I have been finding multi-tenancy to have become more of a liability at worst or a nuisance at best due to the increasingly frequent customer demands to satisfy the data sovereignity, data confidentiality (each tenant wants their own, non-shared cryptography keys etc), data privacy, compliance, the right to forget/GDPR and similar requiremenets. For anything more complex than an anonymous online flower shop, it is simpler to partition the whole thing off – for each customer/tenant.
Any idea how “OpenSearch serverless collections” are implemented? I’m guessing that a “collection” is basically an ElasticSearch index, and “serverless” refers to some method of serializing/loading it on-demand with some cold start tradeoffs?
Basically opensearch and elasticsearch both offer serverless modes now. They work differently because all this was developed post fork. But they do similar things. The Elastic implementation is in early access mode right now, so it is not released yet. I saw a demo of this at one of their meetups in June. I think Opensearch actually moved a bit faster than Elastic on this after Elastic announced that they were working on this a few years ago.
The aim with both is to not have users worry about cluster sizes or cluster management any more, which from an operational point of view is a huge gain because most companies using this stuff aren't very good at doing that properly and the consequences are poor performance, outages, and even data loss when data volume grows.
Serverless essentially decouples indexing and querying traffic. All the nodes are transient and use their local disk only as a cache. Data at rest lives in S3 which becomes the single source of truth. So, if a new node comes up, it simply loads it's state from there and it doesn't have to coordinate with other nodes. There's no more possibility for a cluster to go red either. If a node goes down, it just gets replaced. Basically this makes use of the notion that lucene index segments are immutable after they are written to. There's a cleanup process running in the background that merges segments to clean things up but basically that just means you get a new file that then needs to be loaded by query nodes. I'm not 100% sure how write nodes coordinate segment creation and management. But I assume it involves using some kind of queueing.
So you get horizontal scalability for both reads and writes and you no longer have to worry about managing cluster state.
The tradeoff is that you have a bit increased latency before query nodes can see your incoming data because it has to hit S3 as part of the indexing process before read nodes can pick up new segments. Think multiple seconds before new data becomes visible. Both solutions are best suited for time series type use cases but they can also support regular search use cases. Any kind of use case where the same documents get updated regularly or where reading your own writes matters, things are not going to be great.
Amazon's implementation of course probably leverages a lot of their cloud stuff. Elastics implementation will eventually be available in their cloud solution on all supported cloud providers. Self hosting this is going to be challenging with either solution. So that's another tradeoff.
Thanks for the detailed response and insight. This is a great example of when meetups and in-person networking/collaboration can help you stay ahead of the curve.
It does sound like the solution glosses over some cold start problems that will surface with increasing regularity for more fragmented indexes. For example if you have one index per tenant (imagine GitHub search has one public index and then additionally one index per authenticated user containing only repos they are authorized to read), then each user will experience a cold start on their first authenticated search.
I bet these tradeoffs are not so bad, and in practice, are worth the savings. But I will be curious to follow the developments here and to see the limitations more clearly quantified.
(Also this doesn’t address the writes but I’m sure that’s solvable.)
> For anything more complex than an anonymous online flower shop, it is simpler to partition the whole thing off – for each customer/tenant.
Is this really a viable approach at the scale of B2B SaaS like Salesforce (or, contextually, Algolia)? They would end up with literally 10s of 1000s of DBs. That is surely cost-prohibitive.
Doesn't that affect BM25 with a solution like Elasticsearch as well? Or is that smart enough to apply filters to the corpus statistics before calculating relevance?
You could solve that in SQLite by giving each user their own separate FTS table - not impossibly complex, but would grow increasingly messy if you have 10s of thousands of users.
One way to address this in Elasticsearch would be to put each customers documents in their own index. Other than that, as far as I can tell it's not smart enough to apply filters first.
I wrote this blog :). Good to see it still getting use.
FYI for folks just skimming this, shards can affect scoring, but they don't have to. 2 mitigations:
1. The default in Elasticsearch has been 1 shard per index for a while, and many people (if not most) probably don't need more than 1 shard
2. You can do a dfs_query_then_fetch query, which adds a small amount of latency, but solves this problem
The fundamental tenant is accurate here that any time you want to break up term statistics (e.g. if you want each user to experience different relevance results for their own term stats) then yes, you need a separate index for that. I'd say that's largely not all that common though in practice.
A more common problem that warrants splitting indices is when you have mixed language content: the term "LA" in Spanish content adds very little information while it adds a reasonable amount of information in most English language documents (where is can mean Los Angeles). If you mix both content together, it can pollute term statistics for both. Considering how your segments of users and data will affect scoring is absolutely important as a general rule though, and part of why I'm super excited to be working on neutral retrieval now
Thanks for the clarifications! I've been spending the last 3 weeks deep in the weeds of TF/IDF scoring and was about to give up on Elastic Search when this got posted. The article has been eye opening!!!
It’s why you need “enough” data. Generally the distribution of teens is a power-law distribution. There’s a set of common terms and a very long tail of other terms. This will be the case across every shard if you do in fact have enough data per shard. This becomes a factor in certain distributed aggregations as well. YMMV of course.
In general you trade accuracy for speed. You tend to get a lot of speed for just a small amount of accuracy sacrifice.
The multi-tenancy problem actually already applies to almost every multitenant database in Postgres, Oracle and MySQL. Whether or not they use FTS. You just might not notice its impact in your case if query performance is "good enough".
War story time.
Awhile ago, I worked on a big SQL (Oracle) database, millions of rows per tenant, 10ks of tenants. Tenant data was highly dissimilar between most tenants in all sorts of ways, and the distributio n of data "shape" between tenants wasn't that spiky. Tenants were all over the place, and the common case wasn't that common. Tenants (multitenant tables and/or whole separate schemas) routinely got migrated between physical database hosts by hand (GoldenGate didn't exist at the beginning of this company, and was too expensive by the end).
A whole host of problems in this environment cropped up because, inside the DB, indexes and cache-like structures were shared between tenants. An index histogram for a DATETIME column of a multitenant table might indicate that 75% of the dates were in 2016. But it turns out that most of that 75% were the rows owned by a few big tenants, and the other 1000+ tenants on that database were overwhelmingly unlikely to have 2016 dates. As a result, query plans often sucked for queries that filtered on that date.
Breaking tables up by tenant didn't help: query plans, too, were cached by the DB. Issue was, they were cached by query text. So on a database with lots of separate (identical) logical schemas, a query plan would get built for some query when it first ran against a schema with one set of index histograms, and then another schema with totally different histograms would run the same query, pull the now-inappropriate plan out of the cache, and do something dumb. This was a crappy family of bug, in that it was happening all the time without major impact (a schema that was tiny overall is not going to cause customer/stability problems by running dumb query plans on tiny data), but cropped up unpredictably with much larger impact when customers loaded large amounts of data and/or rare queries happened to run from cache on a huge customer's schema.
The solve for the query plan issue? Prefix each query with the customer's ID in a comment, because the plan cacher was too dumb (or intended for this use case, who knows?) to strip comments. The SQL keys in the plan cache would end up looking like a zillion variants of "/* CUSTOMER-1A2BFC10 */ SELECT ....". I imagine this trick is commonplace, but folks at that gig felt clever for finding it out back in the bad old days.
All of which is to say: database multitenancy poses problems for far more than search. That's not an indictment of multitenancy in general, but does teach the valuable lesson that abandoning multitenancy, even as wacky and inefficient as that seems, should be considered as a first-class tool in the toolbox of solutions here. Database-on-demand solutions (e.g. things like neon.tech, or the ability to swiftly provision and detach an AWS Aurora replica to run queries, potentially removing everything but one tenant's data) are becoming increasingly popular, and those might reduce the pain of tenant-per-instance-ifying database layout.
Take multi-tenancy.
What if user 1 has many more documents than user 2, and uses "new orleans" a lot. But user 2 does not. User 2 does the search.
The db will first use FTS, and then filter. So user 1 will bias the results of user 2. Perhaps enough for user 2 to discover what words are in user 1 corpus.