The hardest part of building any search engine is keeping the index up-to-date with changes made to the underlying data store.
It's a solvable problem, but it's always a lot of work to build and to keep working as the database schema changes in the future.
This is why I really like PostgreSQL FTS: it's good enough that for many projects I don't need to use an external search engine any more - and it's way easier to keep the search index up-to-date than if the index lives in a separate system.
Have you ever considered using PGroonga (https://pgroonga.github.io/) to further extend FTS in postgres? They've got a drilldown feature that is basically their implementation of facets. Always wondered what your thoughts on that might be?
I've got a django setup with postgres and facets implemented as you detailed in your blog (thank you for all the amazing info over the years btw!), now I'm looking to extend it with PGroonga and potentially considering Hasura to try and further extend what I can do while still using Postgres as my single source of truth without need for keeping any other service in sync.
Yeah ZomboDB looks interesting, I guess the main difference is Groonga is an embedded database, more like sqlite; whereas Elastic Search you still gotta spin up an instance and be responsible for it.
I guess you're still responsible for the Groonga db on the postgres server too but theoretically it seems more secure than spinning up another service on a different server
I retired a dedicated Solr instance and replaced it with PG full text search in a couple of days. The sync inconsistency was creating a ton of support inquiries.
Never missed it. It’s the feature that hooked me on PG.
> ZomboDB brings powerful text-search and analytics features to Postgres by using Elasticsearch as an index type. Its comprehensive query language and SQL functions enable new and creative ways to query your relational data.
> From a technical perspective, ZomboDB is a 100% native Postgres extension that implements Postgres' Index Access Method API. As a native Postgres index type, ZomboDB allows you to CREATE INDEX ... USING zombodb on your existing Postgres tables. At that point, ZomboDB takes over and fully manages the remote Elasticsearch index and guarantees transactionally-correct text-search query results.
I find other things also hard in search engines: dealing with the plethora of human languages and all the requirements we may have to processing them. A mature solution like ES therefor is almost a must in the more demanding cases.
> The hardest part of building any search engine is keeping the index up-to-date with changes made to the underlying data store.
That makes sense and I'm curious to know if you're referring to the challenges that arise from attempting to implement a distributed heterogeneous transaction that includes system of record (say RDBMS) and derived data (search engine), or did you think about something that's more specific to search engines?
Isn’t it as easy as updating search index every time change is being made to the database? That’s what we did back in 2012, the website had only 2mln uniques per month though.
It is some work and definitely not as simple as having search index embedded in the database, but again almost everybody have something like celery running For async task execution anyway.
The difficulty is to make this reliable by ensuring that the FTS index is not drifting away from its source of truth (a SQL database for example). What if your process crashes after writing to the SQL database but before writing to the FTS engine? You need some kind of Change Data Capture, track it and apply the changes to the FTS index.
A project I was working on wanted to do this, the basic idea was use Oracle (not my choice) as a source of truth and for OLTP and immediately blow any changes off to SOLR replicas for fulltext searching (for scalability/etc). Big-picture the idea was (imo, never set out directly) that they wanted clients to optimistically search on (and perhaps work on) the data from SOLR but "they also wanted the DB as an authoritative source".
We had it set up so that SOLR would basically run the queries, and return only the UUIDs of the database objects you wanted to return (plus of course any faceting data/etc). Then we would turn around and load the actual results from the query from the RDBMS - it's the source-of-truth and the data is freshest. Conveniently, the same way you generate the JSON for SOLR also works fantastic for generating the JSON for returning the data - this can be the same function so there will never be a mismatch between what's generated via DB-JSON and what's generated via ORM-JSON. It also means basically the entire backend layer can be pretty much generic - at the most basic level, you are just hammering off a query to SOLR and getting some UUIDs back, and then doing a batch-load from the DB, and getting some (uuid, record_version, json_body) tuples back, there is nothing "object-specific" in the backend object handler per-se, at least in the search/fetch code. Inserts/updates still need to work directly on the relational model, but all your search/retrieval code is completely generic (beyond like, table/view names). Furthermore - there is no "ORM collection spam" as it fires off individual queries to get collections for each row in the batch - since the object is already just a JSON string when it comes back, loading 250 rows will have (theoretically) exactly 250 results, there is no combinatorial blowout from joins, or hammering the DB with 250 additional queries (which perhaps fire off their own queries). One string per object, done.
I had an intelligent batch-loader built out with temporary tables/etc so it wasn't even "load 200 rows one at a time"... you inserted the rows you want to load, inner join your temporary table (which is still completely generic and nothing object-specific) and you get 200 results in a single query, fast as hell, without destroying your query planner with IN(:1, :2...) since every query has identical literal sqltext.
Instead of directly poking specific records into SOLR, the backend microservice would tell SOLR "there are new records to pull" and SOLR would do what they call a "delta-import" and actually go out to the database itself over JDBC and do an incremental import of all rows changed since the update. Since SOLR itself could ensure that only one import was running at the time, it would implicitly "batch" large sets of updates into a single query rather than hammering out a bunch of individual ones.
There's always tradeoffs though - the biggest one that concerned me is that since the DB is what's generating the JSON, you have essentially offloaded the entire string-building onto your DB engine. Yeah C/C++ string-building is a lot faster than Java, but you can't scale-out your database like you can a backend service or SOLR replica. Best case you can use microservices and split your database into a couple pieces, but that has problems too of course. It's possible that maybe Postgres has some read-replica type stuff that might be able to scale that out, but, that wasn't really going to be the issue probably. And if you move the JSON-ifying to an external source, you lose the nice performance benefits in terms of one-row-per-result, now you have the collection/combinatorial problems again.
(I wrote an intelligent object-caching mechanism as well, so it wouldn't rebuild every row every time... it'd only pull the rows that it didn't have in cache and would mix the cache+DB results to build the result list while minimizing DB load. So it's not all as bad as it sounds... ORM blowout overloads the database too and you fix that by caching too.)
One inherent problem to this design is desyncs - this isn't specific to my approach but just a problem of having a two-phase operation with things happening in the meantime. There are no exclusive locks on a SOLR while you're doing an update in the DB. I did add code to watch for this as much as possible and could accept it/log it, retry a couple times (will probably fail again on a hot table), try a full rebuild of the SOLR collection, etc, according to the specific situation. But while you can detect missing rows (in solr but not in DB) and stale rows (DB/solr have different versions) you could not ever detect the last kind... it's in DB but SOLR doesn't know about it. If SOLR can't search it for whatever reason (missing/stale/etc) and you never ask DB for it, then it doesn't exist.
As you note, it also means that instead of building Java models/entities you are now (also) building JSON models inside views, and our DBAs were green and already heavily overloaded, so this was not a good thing in terms of team scaling. It definitely pushed the workload into some weird places in that sense. In the long term what I'd have wanted to do is go further into the "ORM building exercise" here and automatically build matching view schemas from a document. A lot of these really map down to very similar Jackson-object-mapper or "hbm2ddl.auto" style code-generation tasks... complex to write the ORM and tedious to write your bindings but not actually all that difficult. A list<T> at the java level means you need a (whatever) JSON_LIST_AGG(TABLE_T, ...) element with that field name in the view, etc. Collections do suck in RDBMS, JSON lists of uuids or lists of populated (or partially populated) child elements make JDBC behave a lot better.
Another oddity is that now you basically have two (or more) "facades" onto the same database objects... and this messes with "default" caching behavior. After all, writes go into the database on the ORM relational entity, but they come out on the JSON-entity, and you need to tell the cache about that. It's not hard, but you have to write a custom adapter. Also, if you do a batch-insert (we of course used uuids generated in the application layer) or batch-update, you then have to do a batch-select immediately afterwards to retrieve the JSON-ified record, otherwise there's nothing to put into the cache or return to the client, since the application layer itself never has a JSON-serialization mapping (that all lives in the DB).
(In the long term the direction I was trying to point us was that rather than having JSON views just be a straightforward JSONization of a given table/object we could maintain different JSON-builder-views for different SOLR views servicing different types of services if needed... I kinda wonder how far you'll get on a single representation.)
That project was a dumpster fire and after I left I think they've pretty much thrown all this out. Easier to just have everything in DB (not even 21c - no native document support lol) be text and have the application layer do all the JSONifying.
But I'm still proud of turning some pretty nonsensical design requirements into a reasonably cohesive rube-goldberg machine. It's not what I'd have designed, but we had some, uh, overpromising and mismanagement involved, and it was a reasonable attempt at implementing the idea that was sold to the client imo. None of us had any experience in this stack and we were all trying to feel our way through it.
Ironically a bunch of this "full/incremental resync" stuff also ended up being a nice mechanism to bootstrap the SOLR cluster at startup (a full update instead of a delta) and for catching config errors. SOLR and the DB must stay in sync, so, SOLR importer is pointed at the wrong DB (according to the SOLR instance's JDBC URL)? That's a nope, application won't deploy, 503 Fix Your Shit.
(in hindsight I wonder if we'd gone postgres (I did argue for it, but that one was actually a legit understandable business decision...) if we could have done a FDW and gone with a "push" model by having Postgres publish its changes right to SOLR...)
Sure, that's where the hard computer science is - but if you use a tool like Elasticsearch other people have spent decades solving those problems for you already.
I was a bit disappointed by some limitations back then when I tried it for a project of mine. When searching phrases where ordering matters, phraseto_tsquery() does not quite work for larger documents, as the tsvector position values are quite limited: https://www.postgresql.org/docs/14/textsearch-limitations.ht... Here I had much better success with sqlite's FTS implemention.
Good link thinks. Have to say we're moving to pgSearch after implementing SOLR alongside a PG/Rails backend. Makes the stack simpler, less components to worry about (less headaches for gem versioning dependencies with SOLR). Lot to be said for it after reading through what's available now...
An alternative is to just layer the FTS on top of vanilla sql to get the extra stuff (this is what I do for my eCommerce backend), so is pretty simple to have something alike:
SELECT ..
-- Get the fts
IN ( FTS QUERY)
ORDER BY
-- The relevance is hardcoded? mayber in another table that store th rankings?
(
Products,
Inventory,
Invoices,..
)
I found is much easier and predictable if I code the "rankings" based on the business logic instead of let the FTS engine guess it. You can store that stuff as normal columns or use the "sources" (ie: products, inventory) as ways to know what could be more important to pull first.
This have the nice property that our search results are ver good and better: Never return non-sensical stuff! (like searching for a apple in the store and get and blog post!)
i think we are talking a little bit different. it doesnt matter what the variables are (your business logic) or some other variables.
Given a certain variable, tfidf/bm-25 will order by relevance and not by match. So it answers the question, what if the name match was off by two characters and the inventory number is less than 200.
tf-idf does not tell you what to order by...but it takes care of all the edge cases of ordering.
now if ur not using text match anywhere and only using business variables...then this entire thread is not for u. But FTS and lucene attack full text search primarily, and that's where the relevance vs ordering discussion comes from
This has been my understanding of the state of Postgres full-text search. It's great if your search requirements are fairly vanilla, but I haven't seen any solutions for more advanced search needs, such boosting, relevance, scoring, etc.
ts_rank functions use the term frequency within that document. not a global term frequency (which is why u need a separate index like what elasticsearch does).
this is important, cos if a word is too common, its considered less significant for a document match. When we calculate IDF, it will be very low for the most occurring words such as stop words (“is” is present in almost all of the documents, and tf-idf will give a very low value to that word).
Generated columns are not supported by older PG versions. For example if you used AWS AuroraDB with Postgres compatibility, they didn’t support generated columns till a few years ago.
In my experience using PG FTS, there are often some preprocessing tasks you might like to do before storing the index. One common thing is the generated index has to fit inside the column length limit. So if your user provides a long text field with a lot of unique words, then the index will be very long and might lead to a database error. So here you can check for that on the fly, and truncate the result.
Not sure it would apply all that often for search indices, but joins are the usual reason I can't use them when I want to; triggered functions (obviously) allow any query.
I used SQLite’s built in fuzzy search and it is surprisingly powerful. I used it to for full text search over a magic the gathering card database and it was amazing.
As far as I can tell, Lucene is the library that powers the main search-related features of Elastic Search i.e. inverted indices, stemming, tokenization, etc.
Elastic Search, Amazon Open Search, Solr, etc. just wrapped Lucene up in a much easier to use HTTP / Rest API and then added bells and whistles like a coherent DSL, scripting, authentication and authorization, jobs, bulk API, etc. Then they built additional tools like Logstah and Kibana, which integrate using these APIs.
GP did say 'reimplement' (not making it sound trivial, like 'add a sqlite backend plugin') in fairness - i.e. in your terms 'why not implement the ES API for sqlite'.
More like "make a thing that's a bit like elasticsearch, but exposing sqlite query language to a sharded event store" -- not "make elasticsearch but use lucene" because elasticsearch is lucene, and reimplementing it via sqlite wouldn't give elasticsearch.
I'd call it elasticite and then (correctly!) be sued by both elastic and the sqlite orgs...
Because Lucene’s index format is highly optimized for information retrieval and SQLite’s isn’t. I love SQLite but Lucene is the sine qua non heart of ElasticSearch.
we built a listings site similar to ebay in latin america after buying the operation from an european company, https://corotos.com
we made use of bson in the listings table for metadata and FTS inside postgres. spanish dictionaries, weighting, white-listings of words and including metadata inside the tsvectors have been awesome. really quick search, you never have unsync documents in another part of the stack and you save a lot in storage
Metadata inside vectors sounds interesting. How do you use them? Because something like `id:234` would get splitted into two vectors making searching for it complicated.
Among other things in that earlier thread, there's some discussion of aspects of ZomboDB with its developer.
ZomboDB is a Postgres extension that enables efficient full-text searching via the use of indexes backed by Elasticsearch. (From the project's website[0]).
Lots more info in readme at the project's github[1].
I am feeling real bad now as I used the like query as OP did (although in MySQL) for implementing a custom search over two fields ( Name & Description ) in one of my personal projects. I thought it worked pretty well though and I was so happy about it.
Don't confuse full text search with word part search. If you have a list of names, you're probably looking for the latter one, where LIKE works fine and can be indexed with pg_trgrm.
slightly off-topic, but if the goal is to store vast amounts of protein and gene nucleotide sequences and allow wildcard searches (e.g., %CTAAACGGG%), would postgres be the recommended approach?
assume 200K proteins/genes with nucleotide sequences ranging from 20 base pairs to 2.3 million base pairs.
PostgreSQL will not perform well on genome data compared to specialized algorithms like BLAST and its alternatives. For exact search a suffix array should allow fast lookup, but DNA search usually requires approximate matches.
In theory you can support this search with a suffix tree [1] or n-gram index [2], (but you will need n>3 because your alphabet is small.)
I'm not sure what ready-made tools or libraries you might find for this - last time I checked there were a lot more papers than production-quality open source code.
Or you could always just try a fast RE engine like Hyperscan and see if it is fast enough for your use case.
I had some friends who worked on research of data structures specifically for genome sequences. No idea about the details, but they're in the same tune as this paper: https://hub.hku.hk/handle/10722/60628
(the first author was their supervisor, they published a series of similar computational biology papers that you might be interested in, so if you check his other publications you (GP) might find something interesting or a least find an entry point to dig into references and citations that may result in something you'd be interested in -- that said if you find any data structure useful you'll probably have to implement it yourself...)
Very convenient for me that this article swings around, I'm weighing pros and cons of PGSQL and elastic search for full-text search for the business's use-case.
Also consider full ngram search, so you can do client side stemming and use a single index for all languages. It's not out of the box, but installing the ngram extension is literally one line.
Mandatory mention of the RUM extension (https://github.com/postgrespro/rum) if this caught your eye. Lots of tutorials and conference presentations out there showcasing the advantages in terms of ranking, timestamps...
It's much better to use elastic (or solr) and keep your text indices out of the database. They have more features and you keep the load separate. This is important if you suddenly have to index a ton of new documents or even recreate your indices.
I used to keep text indices in Oracle and it would take days to sync them.
There's even a Postgresql Foreign Data Wrapper for elastic, so you can integrate them easily.
One project I witnessed used Postgresql solely as a middle man between the main database and Elastic.
I still think it's best to use something outside of the database, unless you don't have a large volume of docs, don't have complex searches, and don't have a lot of concurrent searches.
I read this article about 6 years ago and after an afternoon I added full-text search to the app I worked on at the time. 6 years later, it's still the only search they use. This small tool already built into postgres has been more than enough for us.
I've had similar luck with sqlite, of all things. Unless it's envisioned to handle huge user loads, sqlite does a pretty amazing job of FTS, all contained in a tiny binary that runs when I want and only when I want. I find it very useful for finding nuggets in titles and descriptions, spanning across thousands or millions of records. the index is often double or even triple the size of the database, but its worth it for a query that returns in a heartbeat.
For static databases (especially where the content is not deleted, or rarely updated) consider using contentless tables or external content tables. The database size remains virtually the same, but you'll have immediate access to almost all FTS features.
I worked on a project where we needed to search in a set of approximately 2,5M addresses, and return the adresses' approximate coordinates. We already used postgres for our other database needs, so slotting this in in our existing stack took us all of a couple days.
I love Postgres but lack of CJK full-text search is the main reason why I am going with Lucene/Solr for in-app search.
It's possible to make Postgres do amazing things (e.g. job queue with SKIP LOCKED!) but at what point do you draw the line and use the right tool for the job?
When the feature doesn't exist! For existing workflows/functionality; I think using a new tool for a feature that doesn't (and won't for some time) exist is a _perfectly_ good reason to introduce the tool. On top of that you're looking at using something that is mature and well-used, Lucene/Solr.
If you wanted to be cheeky you could go ask your CTO if they would foot the bill for someone like 2nd Quadrant to write CJK full-text search into pgsql. ;)
I've done this for some features/issues in random OSS projects and if you feel someone else has the same problem I always feel paying money to have it implemented is well worth it.
In general, we have absolutely no problem with CJK in Postgres. Search is very application specific and some of our users want to do things like kana + kanji search. [1] I don't even know where you would start with that in Postgres.
We also need stuff like language detection and analysis of mixed language data.
A big advantage of going with a dedicated search tool is that it teaches you what you don't know about search, and it turned out that we knew pretty much nothing.
I've since last year migrated virtually all serious content to SQLite, which handless most of my FTS needs flawlessly, or can be programmed to do it with little trouble. The only thing lacking is the lack of a default 'websearch' feature like in PostgreSQL, but you can always write functions to add search syntax and faceted searches.
It's a solvable problem, but it's always a lot of work to build and to keep working as the database schema changes in the future.
This is why I really like PostgreSQL FTS: it's good enough that for many projects I don't need to use an external search engine any more - and it's way easier to keep the search index up-to-date than if the index lives in a separate system.
I wrote this tutorial on implementing faceted search with PostgreSQL and Django a while ago: https://simonwillison.net/2017/Oct/5/django-postgresql-facet...