Postgres full-text search: A search engine in a database (2021)

simonw · on July 11, 2022

The hardest part of building any search engine is keeping the index up-to-date with changes made to the underlying data store.

It's a solvable problem, but it's always a lot of work to build and to keep working as the database schema changes in the future.

This is why I really like PostgreSQL FTS: it's good enough that for many projects I don't need to use an external search engine any more - and it's way easier to keep the search index up-to-date than if the index lives in a separate system.

I wrote this tutorial on implementing faceted search with PostgreSQL and Django a while ago: https://simonwillison.net/2017/Oct/5/django-postgresql-facet...

canadiantim · on July 11, 2022

Have you ever considered using PGroonga (https://pgroonga.github.io/) to further extend FTS in postgres? They've got a drilldown feature that is basically their implementation of facets. Always wondered what your thoughts on that might be?

I've got a django setup with postgres and facets implemented as you detailed in your blog (thank you for all the amazing info over the years btw!), now I'm looking to extend it with PGroonga and potentially considering Hasura to try and further extend what I can do while still using Postgres as my single source of truth without need for keeping any other service in sync.

simonw · on July 12, 2022

That's really interesting- I hadn't seen that project (or Groonga) before.

Looks like it might be similar to https://zombodb.com/ which I haven't tried yet but which looks like a really interesting solution in this space.

seedless-sensat · on July 15, 2022

Perhaps more similar to zombodb's hinted new pronto plugin: https://twitter.com/zombodb/status/1547389026235187201

canadiantim · on July 12, 2022

Yeah ZomboDB looks interesting, I guess the main difference is Groonga is an embedded database, more like sqlite; whereas Elastic Search you still gotta spin up an instance and be responsible for it.

I guess you're still responsible for the Groonga db on the postgres server too but theoretically it seems more secure than spinning up another service on a different server

brightball · on July 12, 2022

I retired a dedicated Solr instance and replaced it with PG full text search in a couple of days. The sync inconsistency was creating a ton of support inquiries.

Never missed it. It’s the feature that hooked me on PG.

daitangio · on July 12, 2022

By thw way, also SQLite seems to have a pretty good search engine inside itself: https://www.sqlite.org/fts5.html

disqard · on July 12, 2022

I personally use this, and I highly recommend it!

cies · on July 12, 2022

> The hardest part of building any search engine is keeping the index up-to-date with changes made to the underlying data store.

This deserves mention, as it solves that problem: https://github.com/zombodb/zombodb

From the README:

> ZomboDB brings powerful text-search and analytics features to Postgres by using Elasticsearch as an index type. Its comprehensive query language and SQL functions enable new and creative ways to query your relational data.

> From a technical perspective, ZomboDB is a 100% native Postgres extension that implements Postgres' Index Access Method API. As a native Postgres index type, ZomboDB allows you to CREATE INDEX ... USING zombodb on your existing Postgres tables. At that point, ZomboDB takes over and fully manages the remote Elasticsearch index and guarantees transactionally-correct text-search query results.

I find other things also hard in search engines: dealing with the plethora of human languages and all the requirements we may have to processing them. A mature solution like ES therefor is almost a must in the more demanding cases.

almog · on July 11, 2022

> The hardest part of building any search engine is keeping the index up-to-date with changes made to the underlying data store.

That makes sense and I'm curious to know if you're referring to the challenges that arise from attempting to implement a distributed heterogeneous transaction that includes system of record (say RDBMS) and derived data (search engine), or did you think about something that's more specific to search engines?

simonw · on July 11, 2022

Just the general challenge of making sure that updates to your database are reflected in your index promptly.

I gave a talk about this a few years ago: https://speakerdeck.com/simon/the-denormalized-query-engine-... / https://www.youtube.com/watch?v=NzcvewgqYog

matesz · on July 12, 2022

Isn’t it as easy as updating search index every time change is being made to the database? That’s what we did back in 2012, the website had only 2mln uniques per month though.

It is some work and definitely not as simple as having search index embedded in the database, but again almost everybody have something like celery running For async task execution anyway.

ngrilly · on July 12, 2022

The difficulty is to make this reliable by ensuring that the FTS index is not drifting away from its source of truth (a SQL database for example). What if your process crashes after writing to the SQL database but before writing to the FTS engine? You need some kind of Change Data Capture, track it and apply the changes to the FTS index.

paulmd · on July 12, 2022

A project I was working on wanted to do this, the basic idea was use Oracle (not my choice) as a source of truth and for OLTP and immediately blow any changes off to SOLR replicas for fulltext searching (for scalability/etc). Big-picture the idea was (imo, never set out directly) that they wanted clients to optimistically search on (and perhaps work on) the data from SOLR but "they also wanted the DB as an authoritative source".

We had it set up so that SOLR would basically run the queries, and return only the UUIDs of the database objects you wanted to return (plus of course any faceting data/etc). Then we would turn around and load the actual results from the query from the RDBMS - it's the source-of-truth and the data is freshest. Conveniently, the same way you generate the JSON for SOLR also works fantastic for generating the JSON for returning the data - this can be the same function so there will never be a mismatch between what's generated via DB-JSON and what's generated via ORM-JSON. It also means basically the entire backend layer can be pretty much generic - at the most basic level, you are just hammering off a query to SOLR and getting some UUIDs back, and then doing a batch-load from the DB, and getting some (uuid, record_version, json_body) tuples back, there is nothing "object-specific" in the backend object handler per-se, at least in the search/fetch code. Inserts/updates still need to work directly on the relational model, but all your search/retrieval code is completely generic (beyond like, table/view names). Furthermore - there is no "ORM collection spam" as it fires off individual queries to get collections for each row in the batch - since the object is already just a JSON string when it comes back, loading 250 rows will have (theoretically) exactly 250 results, there is no combinatorial blowout from joins, or hammering the DB with 250 additional queries (which perhaps fire off their own queries). One string per object, done.

I had an intelligent batch-loader built out with temporary tables/etc so it wasn't even "load 200 rows one at a time"... you inserted the rows you want to load, inner join your temporary table (which is still completely generic and nothing object-specific) and you get 200 results in a single query, fast as hell, without destroying your query planner with IN(:1, :2...) since every query has identical literal sqltext.

Instead of directly poking specific records into SOLR, the backend microservice would tell SOLR "there are new records to pull" and SOLR would do what they call a "delta-import" and actually go out to the database itself over JDBC and do an incremental import of all rows changed since the update. Since SOLR itself could ensure that only one import was running at the time, it would implicitly "batch" large sets of updates into a single query rather than hammering out a bunch of individual ones.

There's always tradeoffs though - the biggest one that concerned me is that since the DB is what's generating the JSON, you have essentially offloaded the entire string-building onto your DB engine. Yeah C/C++ string-building is a lot faster than Java, but you can't scale-out your database like you can a backend service or SOLR replica. Best case you can use microservices and split your database into a couple pieces, but that has problems too of course. It's possible that maybe Postgres has some read-replica type stuff that might be able to scale that out, but, that wasn't really going to be the issue probably. And if you move the JSON-ifying to an external source, you lose the nice performance benefits in terms of one-row-per-result, now you have the collection/combinatorial problems again.

(I wrote an intelligent object-caching mechanism as well, so it wouldn't rebuild every row every time... it'd only pull the rows that it didn't have in cache and would mix the cache+DB results to build the result list while minimizing DB load. So it's not all as bad as it sounds... ORM blowout overloads the database too and you fix that by caching too.)

One inherent problem to this design is desyncs - this isn't specific to my approach but just a problem of having a two-phase operation with things happening in the meantime. There are no exclusive locks on a SOLR while you're doing an update in the DB. I did add code to watch for this as much as possible and could accept it/log it, retry a couple times (will probably fail again on a hot table), try a full rebuild of the SOLR collection, etc, according to the specific situation. But while you can detect missing rows (in solr but not in DB) and stale rows (DB/solr have different versions) you could not ever detect the last kind... it's in DB but SOLR doesn't know about it. If SOLR can't search it for whatever reason (missing/stale/etc) and you never ask DB for it, then it doesn't exist.

As you note, it also means that instead of building Java models/entities you are now (also) building JSON models inside views, and our DBAs were green and already heavily overloaded, so this was not a good thing in terms of team scaling. It definitely pushed the workload into some weird places in that sense. In the long term what I'd have wanted to do is go further into the "ORM building exercise" here and automatically build matching view schemas from a document. A lot of these really map down to very similar Jackson-object-mapper or "hbm2ddl.auto" style code-generation tasks... complex to write the ORM and tedious to write your bindings but not actually all that difficult. A list<T> at the java level means you need a (whatever) JSON_LIST_AGG(TABLE_T, ...) element with that field name in the view, etc. Collections do suck in RDBMS, JSON lists of uuids or lists of populated (or partially populated) child elements make JDBC behave a lot better.

Another oddity is that now you basically have two (or more) "facades" onto the same database objects... and this messes with "default" caching behavior. After all, writes go into the database on the ORM relational entity, but they come out on the JSON-entity, and you need to tell the cache about that. It's not hard, but you have to write a custom adapter. Also, if you do a batch-insert (we of course used uuids generated in the application layer) or batch-update, you then have to do a batch-select immediately afterwards to retrieve the JSON-ified record, otherwise there's nothing to put into the cache or return to the client, since the application layer itself never has a JSON-serialization mapping (that all lives in the DB).

(In the long term the direction I was trying to point us was that rather than having JSON views just be a straightforward JSONization of a given table/object we could maintain different JSON-builder-views for different SOLR views servicing different types of services if needed... I kinda wonder how far you'll get on a single representation.)

That project was a dumpster fire and after I left I think they've pretty much thrown all this out. Easier to just have everything in DB (not even 21c - no native document support lol) be text and have the application layer do all the JSONifying.

But I'm still proud of turning some pretty nonsensical design requirements into a reasonably cohesive rube-goldberg machine. It's not what I'd have designed, but we had some, uh, overpromising and mismanagement involved, and it was a reasonable attempt at implementing the idea that was sold to the client imo. None of us had any experience in this stack and we were all trying to feel our way through it.

Ironically a bunch of this "full/incremental resync" stuff also ended up being a nice mechanism to bootstrap the SOLR cluster at startup (a full update instead of a delta) and for catching config errors. SOLR and the DB must stay in sync, so, SOLR importer is pointed at the wrong DB (according to the SOLR instance's JDBC URL)? That's a nope, application won't deploy, 503 Fix Your Shit.

(in hindsight I wonder if we'd gone postgres (I did argue for it, but that one was actually a legit understandable business decision...) if we could have done a FDW and gone with a "push" model by having Postgres publish its changes right to SOLR...)

nine_k · on July 12, 2022

That's interesting; I thought that computing the relevance of found texts, using a language and text structure models of sorts, is the hardest part.

simonw · on July 12, 2022

Sure, that's where the hard computer science is - but if you use a tool like Elasticsearch other people have spent decades solving those problems for you already.

soperj · on July 11, 2022

How hard is this on the database?

brightball · on July 12, 2022

It’s just like any other index.

crtxcr · on July 11, 2022

I was a bit disappointed by some limitations back then when I tried it for a project of mine. When searching phrases where ordering matters, phraseto_tsquery() does not quite work for larger documents, as the tsvector position values are quite limited: https://www.postgresql.org/docs/14/textsearch-limitations.ht... Here I had much better success with sqlite's FTS implemention.

jfbaro · on July 11, 2022

I have seen this presentation from a company that used PG Full Text search for a pretty complex use case. Interesting -> http://matheusoliveira.s3-website-us-east-1.amazonaws.com/pr... (updated thanks to ddevault)

ddevault · on July 11, 2022

Please don't use URL shorteners.

http://matheusoliveira.s3-website-us-east-1.amazonaws.com/pr...

jfbaro · on July 11, 2022

Updated! Thanks

RowanH · on July 11, 2022

Good link thinks. Have to say we're moving to pgSearch after implementing SOLR alongside a PG/Rails backend. Makes the stack simpler, less components to worry about (less headaches for gem versioning dependencies with SOLR). Lot to be said for it after reading through what's available now...

sandGorgon · on July 11, 2022

are you planning to do relevance ? this is one of the blockers for us - TFIDF vs BM25 or something.

From what i know - PG fulltext seaarch does not implement relevance.

mamcx · on July 11, 2022

An alternative is to just layer the FTS on top of vanilla sql to get the extra stuff (this is what I do for my eCommerce backend), so is pretty simple to have something alike:

  SELECT ..
    -- Get the fts 
    IN ( FTS QUERY)
  ORDER BY
     -- The relevance is hardcoded? mayber in another table that store th rankings?
     (
      Products, 
      Inventory,
      Invoices,..
      )

I found is much easier and predictable if I code the "rankings" based on the business logic instead of let the FTS engine guess it. You can store that stuff as normal columns or use the "sources" (ie: products, inventory) as ways to know what could be more important to pull first.

This have the nice property that our search results are ver good and better: Never return non-sensical stuff! (like searching for a apple in the store and get and blog post!)

sandGorgon · on July 12, 2022

i think we are talking a little bit different. it doesnt matter what the variables are (your business logic) or some other variables.

Given a certain variable, tfidf/bm-25 will order by relevance and not by match. So it answers the question, what if the name match was off by two characters and the inventory number is less than 200.

tf-idf does not tell you what to order by...but it takes care of all the edge cases of ordering.

now if ur not using text match anywhere and only using business variables...then this entire thread is not for u. But FTS and lucene attack full text search primarily, and that's where the relevance vs ordering discussion comes from

rocmcd · on July 11, 2022

This has been my understanding of the state of Postgres full-text search. It's great if your search requirements are fairly vanilla, but I haven't seen any solutions for more advanced search needs, such boosting, relevance, scoring, etc.

SahAssar · on July 12, 2022

Why does ts_rank not work for you? https://www.postgresql.org/docs/current/textsearch-controls....

sandGorgon · on July 13, 2022

ts_rank functions use the term frequency within that document. not a global term frequency (which is why u need a separate index like what elasticsearch does).

this is important, cos if a word is too common, its considered less significant for a document match. When we calculate IDF, it will be very low for the most occurring words such as stop words (“is” is present in almost all of the documents, and tf-idf will give a very low value to that word).

there's someone who implemented this, its pretty cool. but definitely performance takes a hit versus a separate elasticsearch cluster. https://codebots.com/crud/How-to-efficiently-search-text-usi...

lettergram · on July 11, 2022

I’ve written some tips on this before: https://austingwalters.com/fast-full-text-search-in-postgres...

Basically, it’s easy to create triggers that’ll update an index for you on entry. Then you can create custom indices that have key words, etc.

Latty · on July 11, 2022

Is there a benefit to using a trigger over a generated stored column?

iav · on July 12, 2022

Generated columns are not supported by older PG versions. For example if you used AWS AuroraDB with Postgres compatibility, they didn’t support generated columns till a few years ago.

In my experience using PG FTS, there are often some preprocessing tasks you might like to do before storing the index. One common thing is the generated index has to fit inside the column length limit. So if your user provides a long text field with a lot of unique words, then the index will be very long and might lead to a database error. So here you can check for that on the fly, and truncate the result.

Latty · on July 12, 2022

Sure, that makes sense.

OJFord · on July 12, 2022

Not sure it would apply all that often for search indices, but joins are the usual reason I can't use them when I want to; triggered functions (obviously) allow any query.

lettergram · on July 11, 2022

You can technically do way more robust things in a trigger. If you don’t need it then :shrug: but I often create fairly robust updates

SnowHill9902 · on July 12, 2022

TRIGGER can run on UPDATE.

Latty · on July 12, 2022

Not sure I follow, generated columns are updated on UPDATE, so that's just the same?

mixcocam · on July 11, 2022

I think SQLite is even more batteries included! https://www.sqlite.org/fts5.html

shakezula · on July 11, 2022

I used SQLite’s built in fuzzy search and it is surprisingly powerful. I used it to for full text search over a magic the gathering card database and it was amazing.

cduzz · on July 11, 2022

I've often wondered why someone doesn't reimplement elasticsearch using sqlite as the back-end datastore, instead of lucene.

Or did I just describe foundationDB?

I guess the real problem is that such a beast wouldn't have kibana, and people want kibana and put up with elasticsearch to get it.

hiram112 · on July 11, 2022

As far as I can tell, Lucene is the library that powers the main search-related features of Elastic Search i.e. inverted indices, stemming, tokenization, etc.

Elastic Search, Amazon Open Search, Solr, etc. just wrapped Lucene up in a much easier to use HTTP / Rest API and then added bells and whistles like a coherent DSL, scripting, authentication and authorization, jobs, bulk API, etc. Then they built additional tools like Logstah and Kibana, which integrate using these APIs.

OJFord · on July 12, 2022

GP did say 'reimplement' (not making it sound trivial, like 'add a sqlite backend plugin') in fairness - i.e. in your terms 'why not implement the ES API for sqlite'.

cduzz · on July 13, 2022

More like "make a thing that's a bit like elasticsearch, but exposing sqlite query language to a sharded event store" -- not "make elasticsearch but use lucene" because elasticsearch is lucene, and reimplementing it via sqlite wouldn't give elasticsearch.

I'd call it elasticite and then (correctly!) be sued by both elastic and the sqlite orgs...

jonstewart · on July 12, 2022

Because Lucene’s index format is highly optimized for information retrieval and SQLite’s isn’t. I love SQLite but Lucene is the sine qua non heart of ElasticSearch.

canadiantim · on July 11, 2022

More difficult to do facets with sqlite vs postgres though

itsthecourier · on July 11, 2022

we built a listings site similar to ebay in latin america after buying the operation from an european company, https://corotos.com

we made use of bson in the listings table for metadata and FTS inside postgres. spanish dictionaries, weighting, white-listings of words and including metadata inside the tsvectors have been awesome. really quick search, you never have unsync documents in another part of the stack and you save a lot in storage

glad to help anybody going this path, no regrets.

tpetry · on July 12, 2022

Metadata inside vectors sounds interesting. How do you use them? Because something like `id:234` would get splitted into two vectors making searching for it complicated.

cpursley · on July 11, 2022

A writeup or gist of this approach would be awesome.

qohen · on July 11, 2022

FYI, this crunchydata.com blog-post was discussed 11 months ago on HN[0], in a thread awarded 601 points, with 137 comments.

[0] https://news.ycombinator.com/item?id=27973497

qohen · on July 11, 2022

Among other things in that earlier thread, there's some discussion of aspects of ZomboDB with its developer.

ZomboDB is a Postgres extension that enables efficient full-text searching via the use of indexes backed by Elasticsearch. (From the project's website[0]).

Lots more info in readme at the project's github[1].

[0] http://zombodb.com

[1] https://github.com/zombodb/zombodb

ravirajx7 · on July 11, 2022

I am feeling real bad now as I used the like query as OP did (although in MySQL) for implementing a custom search over two fields ( Name & Description ) in one of my personal projects. I thought it worked pretty well though and I was so happy about it.

giovannibonetti · on July 11, 2022

Don't confuse full text search with word part search. If you have a list of names, you're probably looking for the latter one, where LIKE works fine and can be indexed with pg_trgrm.

panabee · on July 11, 2022

slightly off-topic, but if the goal is to store vast amounts of protein and gene nucleotide sequences and allow wildcard searches (e.g., %CTAAACGGG%), would postgres be the recommended approach?

assume 200K proteins/genes with nucleotide sequences ranging from 20 base pairs to 2.3 million base pairs.

tveita · on July 12, 2022

PostgreSQL will not perform well on genome data compared to specialized algorithms like BLAST and its alternatives. For exact search a suffix array should allow fast lookup, but DNA search usually requires approximate matches.

I don't know what is state of the art these days, but this is probably a good place to start: https://en.wikipedia.org/wiki/BLAST_(biotechnology)#Alternat...

panabee · on July 12, 2022

thanks for the suggestion.

the reason to create a BLAST alternative is to gain more granular control over matches.

unless i'm mistaken, BLAST won't help on searches akin to this reg ex: GTC[ATGC]{3,10}AACCGT

(i.e., GTC followed by any sequence of 3-10 nucleotides then ending with AACCGT.)

another way of framing the goal is to enable reg ex searches over genome/protein data.

do you know if anything allows this today?

tveita · on July 12, 2022

In theory you can support this search with a suffix tree [1] or n-gram index [2], (but you will need n>3 because your alphabet is small.)

I'm not sure what ready-made tools or libraries you might find for this - last time I checked there were a lot more papers than production-quality open source code.

Or you could always just try a fast RE engine like Hyperscan and see if it is fast enough for your use case.

[1] https://dl.acm.org/doi/10.1145/235809.235810

[2] https://swtch.com/~rsc/regexp/regexp4.html

hnfong · on July 12, 2022

I had some friends who worked on research of data structures specifically for genome sequences. No idea about the details, but they're in the same tune as this paper: https://hub.hku.hk/handle/10722/60628

(the first author was their supervisor, they published a series of similar computational biology papers that you might be interested in, so if you check his other publications you (GP) might find something interesting or a least find an entry point to dig into references and citations that may result in something you'd be interested in -- that said if you find any data structure useful you'll probably have to implement it yourself...)

panabee · on July 12, 2022

thanks for sharing. will check this out!

panabee · on July 12, 2022

thanks for the suggestions.

to confirm, it sounds like you agree that BLAST wouldn't work for regex searches?

prefer not creating a specialized search tool if avoidable, but there doesn't seem to be any alternative.

giovannibonetti · on July 11, 2022

Pg_trgrm based indexes works well to make this type of LIKE %...% queries fast

SnowHill9902 · on July 12, 2022

It could work but not that way. I'd try t (protein_id bigint, base_id bigint, base char) and then FILTER OVER.

adhoc_slime · on July 11, 2022

Very convenient for me that this article swings around, I'm weighing pros and cons of PGSQL and elastic search for full-text search for the business's use-case.

avereveard · on July 11, 2022

Also consider full ngram search, so you can do client side stemming and use a single index for all languages. It's not out of the box, but installing the ngram extension is literally one line.

puika · on July 11, 2022

Mandatory mention of the RUM extension (https://github.com/postgrespro/rum) if this caught your eye. Lots of tutorials and conference presentations out there showcasing the advantages in terms of ranking, timestamps...

LAC-Tech · on July 12, 2022

I really like the Postgres philosophy.

It may not be the best full text search engine, or the best time series database, or the best document/"NoSQL" database.

But if you use it you have all of the above, with having to wrangle micro services.

kumarvvr · on July 12, 2022

With the right configuration, you can use it as either of them.

Only thing is that data compression / formatting / structuring and corresponding indexing config must be good.

A simple table with a PK and a document column is enough to make it into a great MongoDB alternative, upto a certain size.

forinti · on July 11, 2022

It's much better to use elastic (or solr) and keep your text indices out of the database. They have more features and you keep the load separate. This is important if you suddenly have to index a ton of new documents or even recreate your indices.

I used to keep text indices in Oracle and it would take days to sync them.

There's even a Postgresql Foreign Data Wrapper for elastic, so you can integrate them easily.

One project I witnessed used Postgresql solely as a middle man between the main database and Elastic.

tootie · on July 11, 2022

I think the main use case would be to index content already in your DB.

forinti · on July 11, 2022

I still think it's best to use something outside of the database, unless you don't have a large volume of docs, don't have complex searches, and don't have a lot of concurrent searches.

threesquared · on July 12, 2022

Why would you use Postgres for this which doesn't horizontally scale when you could easily index documents in ES and use that for full text search?

elpatoisthebest · on July 12, 2022

If you only need modest search functionality and you don't want to manage another service/system, Postgres full-text search may be for you.

http://rachbelaid.com/postgres-full-text-search-is-good-enou...

I read this article about 6 years ago and after an afternoon I added full-text search to the app I worked on at the time. 6 years later, it's still the only search they use. This small tool already built into postgres has been more than enough for us.

bshipp · on July 12, 2022

I've had similar luck with sqlite, of all things. Unless it's envisioned to handle huge user loads, sqlite does a pretty amazing job of FTS, all contained in a tiny binary that runs when I want and only when I want. I find it very useful for finding nuggets in titles and descriptions, spanning across thousands or millions of records. the index is often double or even triple the size of the database, but its worth it for a query that returns in a heartbeat.

knighthack · on July 12, 2022

For static databases (especially where the content is not deleted, or rarely updated) consider using contentless tables or external content tables. The database size remains virtually the same, but you'll have immediate access to almost all FTS features.

hhjinks · on July 12, 2022

I worked on a project where we needed to search in a set of approximately 2,5M addresses, and return the adresses' approximate coordinates. We already used postgres for our other database needs, so slotting this in in our existing stack took us all of a couple days.

pier25 · on July 12, 2022

Most projects will never scale horizontally.

xvello · on July 12, 2022

Most projects will never *need* to scale horizontally.

Unfortunately, many will, due to FOMO and Resume Driven Development.

pier25 · on July 13, 2022

Good point :)

jonatron · on July 11, 2022

Last time I looked, Postgres full text search doesn't really do CJK without third party parsers, which can be a bit of a deal breaker.

canadiantim · on July 12, 2022

I believe you can do pretty comprehensive CJK search with PGroonga extension for Postgres.

zkirill · on July 11, 2022

I love Postgres but lack of CJK full-text search is the main reason why I am going with Lucene/Solr for in-app search.

It's possible to make Postgres do amazing things (e.g. job queue with SKIP LOCKED!) but at what point do you draw the line and use the right tool for the job?

chucky_z · on July 11, 2022

When the feature doesn't exist! For existing workflows/functionality; I think using a new tool for a feature that doesn't (and won't for some time) exist is a _perfectly_ good reason to introduce the tool. On top of that you're looking at using something that is mature and well-used, Lucene/Solr.

If you wanted to be cheeky you could go ask your CTO if they would foot the bill for someone like 2nd Quadrant to write CJK full-text search into pgsql. ;)

I've done this for some features/issues in random OSS projects and if you feel someone else has the same problem I always feel paying money to have it implemented is well worth it.

blowski · on July 11, 2022

Any idea how well CJK works in general with Postgres? Or is it just on full text where it's lacking?

zkirill · on July 11, 2022

In general, we have absolutely no problem with CJK in Postgres. Search is very application specific and some of our users want to do things like kana + kanji search. [1] I don't even know where you would start with that in Postgres.

We also need stuff like language detection and analysis of mixed language data.

A big advantage of going with a dedicated search tool is that it teaches you what you don't know about search, and it turned out that we knew pretty much nothing.

[1] https://www.elastic.co/blog/implementing-japanese-autocomple...

mistrial9 · on July 11, 2022

the Japanese national phone company NTT is a major postgresql contributor

robertsdionne · on July 12, 2022

You can use the pg_trgm extension.

knighthack · on July 12, 2022

I've since last year migrated virtually all serious content to SQLite, which handless most of my FTS needs flawlessly, or can be programmed to do it with little trouble. The only thing lacking is the lack of a default 'websearch' feature like in PostgreSQL, but you can always write functions to add search syntax and faceted searches.

cpursley · on July 11, 2022

Here's the approach I settled on that allows for fuzzy search (also some useful links at the bottom):

https://gist.github.com/cpursley/e3586382c3a42c54ca7f5fef166...

Downside is I haven't found a way to do weighting.

didip · on July 12, 2022

I think PostgreSQL has enough features by now. They should try to focus on operational and scalability aspects now.

phemartin · on July 11, 2022

mongodb has its own full-text implementation with Atlas Search. it has full Lucene capability. it's worth looking into.

https://www.mongodb.com/docs/atlas/atlas-search/

cpursley · on July 11, 2022

If you're already running postgres, why would you set up mongo for search instead of elastic, etc?

jwmoz · on July 12, 2022

I use the trigram functionality for a movie name autocomplete feature, works pretty well.