> You’re dealing with survivorship bias - those companies whose products were successful enough that they needed your services
In fairness, Redis has only been available for 12 years so someone who has been in the industry longer has probably encountered systems using DBs as queues for no better reason than an alternative was not available when the system was made. (Rabbit just a couple years older I think.)
But in this day and age, you have better options to start with c'mon.
> they spent too much time fiddling with RabbitMQ, when Postgres would have been fine.
Rather ironic claim in a thread full of tips, tricks, and gotchas just to make PG behave like a queue no?
I can't speak for RabbitMQ personally but Redis has got to be among the simplest drop-in dependencies I've ever encountered. For a simple queue, the defaults do out-of-the-box.
Gut feel also says, Redis defaults for queuing would also serve more than PG defaults for queuing so you could leave it untouched for far longer to focus on your product, get to profitability, etc. Feel free to debunk with data.
There's one incredibly valuable feature you get with PG that you will never get with Redis or RabbitMQ or any other task queue: Enqueueing tasks transactionally with your unit of work.
If your load requirements fit using your database as a queue, it can radically simplify your system.
I've personally done a fair bit of migration from formal queue systems into an RDBMS. You simply get a lot more control and visibility that way. But I use task queues heavily too. It's all very application-specific and I don't think you can generalize.
Not sure why you were downvoted, but perhaps they were thinking about Redis transactions (lua scripts that can do multiple operations at once). If Redis is your datastore (groan), then you can transactionally enqueue tasks with your unit of work.
Real time trading systems can totally get away with only using Redis, outside of historical data (which obviously you'd stick into Postgres or your favorite flavor of DB)
You can actually do this with RabbitMQ - it has a feature called "Publisher Confirmations", where you send a message, and then wait for an "ack" in return.
If you’re using the same database for your queue as your business entities, then you can wrap both in the same transaction - I think that’s what the parent post is getting at. For example, save a customer entity and add a job to send them a confirmation email. As far as I know, that’s something you can’t do natively if you’re using RabbitMQ and PostgreSQL. Of course you can work around it, but when your workloads are low you don’t need to, which is great for early stage products.
It doesn’t help guarantee the email is sent, but it makes it easier to find and handle unhappy edge cases, which is where a lot of time is spent in an early-stage startup.
> Rather ironic claim in a thread full of tips, tricks, and gotchas just to make PG behave like a queue no?
There are libraries in many languages written specifically to accomplish the "make PG behave like a queue" part. All these people are explaining the principle of how it would work, but in practice, you can just pull in one of those libraries and be done. (Or, more accurately, you use an abstract external-job-queuing library and tell it to use its Postgres adapter, which in turn uses that library.)
In exchange, you get the ability to not have to set up / maintain any more infra than you already have. Which can be important if you're already "at scale", where every component you set up needs to be duplicated per geographic-region, clustered for High Availability, etc. If you've already done all that for Postgres, it'd be nice to not also have to do all that again, differently for Redis, RabbitMQ/Kafka/NATS, etc.
> Gut feel also says, Redis defaults for queuing would also serve more than PG defaults for queuing
Don't know about what you've got your queues doing, but our service only does (persistent, out-of-process) queuing for two purposes:
1. Global scheduled background-job queuing. (Think "cron jobs.") There are only a static number of these; they don't scale up with load.
2. Lifecycle email dispatch. This scales up with MAU, but with very very low coefficients.
For those cases, Postgres would totally be enough. (Anything would be enough!)
Most "queues" needed in the real world are requirements-light like this. It's a rare company whose product is doing something Apache BEAM-alike with its data, where a high-throughput reliable MQ is a fundamental part of the data workflow. (And those companies, I hope, know that they need an actual MQ, not a piece of software that does queuing in its spare time.)
Our company doesn't use Postgres queuing; we do, in fact, use Redis for it instead. But we only ended up doing that, because we already needed Redis for other stuff; and if you already have Redis in play (including an ops plan for scaling it), then it's the better tool for the job.
> In exchange, you get the ability to not have to set up / maintain any more infra than you already have.
And, you mean, a PGQ will not need any special monitoring other than the common DB metrics?
For instance, if I ever run a queue, it's just due ops diligence to know the length of the queue, average time an item spends in queue, throughput over time, among others. Are there standard monitoring modules that would check this for a PGQ? Because in exchange for setting up a proper queue cluster, compatibility and documentation for common stuff like this is also what you get.
The first one is particularly a sticking point for me. If you don't do it right, you will end up issuing a COUNT on a table periodically. You might say it's acceptable for most companies, because they don't need high-performance queues, but I specifically need my monitoring to be reliable during times of unexpected high load. Also, ideally, there is close to zero chance that my telemetry is the one that ends up causing me trouble.
Also, why does no one consider the added "surface for human error" that using a DB for a queue introduces. For instance anyone can drop in the DB and perform a query you were not supposed to. A malicious actor can update all outgoing emails in the queue to another recipient. If these are sensitive emails like password reset or OTP, good luck. A dedicated queue process does not allow such operations.
I have to say, it's very perplexing to me that people (especially in HN) would avoid using the right structure just because they have to invest a bit more work.
> I have to say, it's very perplexing to me that people (especially in HN) would avoid using the right structure just because they have to invest a bit more work.
You're imagining this as if it's just the CapEx (the work of setting up Redis), but in reality it's the OpEx (the work of maintaining a Redis cluster) that kills you. Think of it in terms of the number of ops salaries you have to pay. At scale, this number goes up in proportion to the complexity of your stack.
Actually, if you read the rest of my previous reply (as opposed to an offhand comment at the end) you will see that I'm considering the logistics of operating an in-DB queue as well.
Using an in-DB queue doesn't give you zero OpEx does it? Maybe you can make the argument that it's cheaper but it's cheaper for a reason: the job wasn't done properly, so to speak.
Both options introduce new failure modes and op costs into your system. Might as well do it properly if (and this is a huge IF in my opinion) for slightly more cost. When you run into a failure it's standard, maybe the solution is even one Google search away as opposed to realizing, one hour into a late-night debugging session that, shit, that table should be treated as a queue!
I read your whole reply. I didn't see any mention of anything I'd consider an OpEx cost.
Re: Monitoring
Presuming you're already monitoring anything at all, adding monitoring for a new piece of infrastructure is a pure CapEx cost. You presumably already have monitoring infrastructure running. Its ingestion, by design, will be 99% idle — i.e. it won't need to be scaled horizontally proportionally to the number of components. The only thing needed, then, will be careful up-front design for the monitoring.
Which is usually also a cost already paid for you in advance when you use an abstract job-queue library. They all know how to expose Prometheus metrics endpoints, and they do the accounting to serve those endpoints efficiently (usually using process-local per-job-producer and per-job-consumer counters, which you must then roll up yourself at the PromQL level, taking the irate() to find spikes.)
Re: AAA
> Also, why does no one consider the added "surface for human error" that using a DB for a queue introduces.
Because the type of person who reaches for their DB first to solve a novel problem, is the same type of person who understands and makes use of their DB's security features, to the point that doing more DB-security-config isn't a complex one-off problem for them, but something they can whip off in a few minutes.
For a simple option, you can create two service-account DB roles: a queue_producer, and a queue_consumer. You can put all the job-queue implementation tables in a schema owned by the queue_producer; and then grant the queue_consumer SELECT privileges on all the tables, and UPDATE privileges on some of them. Then, nobody but the job_producer (or a DB superuser) can create or destroy jobs; and nobody but a job_producer, a job_consumer, or a superuser, can read or modify jobs. (Your job-queue abstraction library within your service usually maintains its own DB connection pool anyway, so it's no sweat to have those connections use their own credentials specific to their job-queue role.)
For a more complex — but perhaps useful? — option, the tables themselves can be "hidden" behind stored procedures (DDLed into existence by the abstract job-queue library), where nobody has any privileges (incl. SELECT) on the tables, only EXECUTE rights on the sprocs. And the sprocs are carefully designed to never do anything that could have an unbounded CPU time. Then anyone can "check up on" or even "insert into" the job-queue, but nobody can do anything "funny" to it. (My god, it's an encapsulated API!)
Once again — the libraries that abstract this away, already think about concerns like this, and choose one or the other of these options. That's why libraries like this exist, when the "core premise" is so simple: it's so there's a place to put all the fine details derived from man-years of thought on how to make this approach robust.
-----
On a completely different track, though: having a queue in the DB can sometimes be the optimal (i.e. the "if we had infinite man-hours for design + implementation") engineering decision. This case comes when the thing the queue is operating upon is data in the DB. In such a case, the DB data modification, and the job's completion, can succeed or fail together atomically, as part of a single DB transaction.
To accomplish the same thing when your queue lives outside the DB, you usually end up either needing some really gnarly distributed-locking logic that both your DB and your app layer need to know everything about (leaky abstraction!); or you need to completely remodel your data and your job queue into an event-streaming paradigm, so that you can "rewind" one side when the other side fails.
> For a simple suggestion, the job queue can live in a schema that is owned by a job-queue service-account role. Nobody other than that user (or a DB superuser) can issue any query in there.
Yes service accounts are fine and dandy but [a] isn't that more hoops borne by choosing the wrong tool for the job (gotta stay mindful of those permissions, you can have an Ansible typo, etc) and [b] as I said, proper queue services would not even allow tampering of enqueued data. For no extra cost other than installing the service. In contrast, your service account falling into a malicious actor is still a threat vector.
> For a more complex — but perhaps useful? — option ...
Now this is just a queue service with extra steps running in a relational DB instead of natively as an OS process. You did cite it as just an option but I don't see why this is an attractive option.
> I didn't see any mention of anything I'd consider an OpEx cost.
It seems to me we have differing definitions of OpEx and, judging by your previous comment, you value this almost exclusively in terms of "the number of ops salaries you have to pay". Even if I play by that definition, I could tell from experience (and also corroborated by other commenters here, in other subthreads of this discussion), that operating a Redis cluster does not need extra warm bodies. The people who monitor that your app responds, your servers aren't running out of memory, could take on the task with little to no additional training.
The experience I want to cite, bearing in mind of course that everyone's mileage varies: in my previous job, customer base of 2M and growing across two countries, we operated and monitored a redundant Redis cluster with an ops team of 3. In my first job I talked about in another subthread here, we managed three high-traffic Redis queues (and a few other smaller-traffic special cases) with a team of 7. In both jobs, we also did active across-the-stack development; we weren't dedicated to monitoring, nor to the components related to the queues.
In fact I would argue running a DB (queue or no queue) is just more complex than running a queue service. Rather telling is the need for "DB Administrators/Specialists" but no such equivalent for Redis or other proper queueing solutions.
> Which is usually also a cost already paid for you in advance when you use an abstract job-queue library. They all know how to expose Prometheus metrics endpoints...
Honestly, this sounds fine but I'm rather wary of the caveats that might be present. How do they do it efficiently? What do I need to keep in mind to keep things efficient? As such, unless you can point me to your preferred job-queue library which does all these wonders, I have to reserve judgment.
Edit: I just remembered mentioning this in another subthread. Even if you have a fantastic library/middleware abstracting all those queue concerns, that ties you in with that library. If someone wants to build integrations into your queue, they have to play by that library. If majority of your app is in Java (and so is the library), and Data Science (who, in this example, uses Python, not improbable to happen) wants to share your data pipeline, if that library isn't available for them, tough luck I guess? More dev time for someone.
And also, whatever underlying DB features or semantics your library might rely on in order to enforce a queue structure, you can't be assured that future versions of your DB would support those features because, you know, your DB isn't really in the business of queues. It opens you up to running an outdated DB version just to keep that queue functionality.
> Now this is just a queue service with extra steps running in a relational DB instead of natively as an OS process. You did cite it as just an option but I don't see why this is an attractive option.
Your DB then shares a WAL log with your queue. Meaning a single managed physical replication pipeline for them both. Meaning only one set of leader-election issues to debug, not two. Meaning one canonical way to do geographic high-latency async replication. Meaning disaster recovery brings back a whole-system consistent snapshot state. Etc.
Honestly, if I had my way, every stateful component in the stack would all share a single WAL log. That’s what FoundationDB and the like get you.
> In fact I would argue running a DB (queue or no queue) is just more complex than running a queue service.
Well, yeah, but you usually need a DB. So, if you’re going to be paying the OpEx costs of the DB either way, then you may as well understand it deeply in order to wring the most productive use you can out of each OpEx dollar/man-hour spent.
(I feel the same way about Redis, as it happens: if you need it, and are locking your code into its model anyway, then you may as well take advantage of its more arcane features, like Redis Streams, Lua scripting, etc.)
However, maybe our company is uncommon in how much our service literally is doing fancy complex DB queries that use tons of DB features. We’re a data analytics company. Even the frontend people know arcane SQL here :)
> that ties you in with that library
The difference between what you / apps / abstracting libraries do in Redis, and what they do in an SQL DB, is that in the DB, the shape of everything has to be explained in a vendor-neutral manner: SQL DDL.
Sometimes Redis-based solutions converge on conventional schemas; see e.g. Sidekiq’s informal schema, which several other queuing systems are implemented in terms of. But when they don’t, there’s nothing you can really do — beyond hacking on the libraries involved — to bring them into sync.
In an SQL DB, anything can be adapted into the expected shape of anything else, by defining SQL views. (Heck, in an SQL DB with Redis support, like Postgres with redis_fdw, the Redis data can be adapted into any shape you like using SQL views.)
And that’s further enabled by the fact that the DB had received from the app, through DDL, a schema, that you can examine, manipulate, and refactor; or even synthesize together with other schemas.
> you can't be assured that future versions of your DB would support those features
You can if those features are in the SQL standard. I’ve never heard of a DBMS regressing on its level of SQL standard support.
> Your DB then shares a WAL log with your queue...
I feel like the understanding we can come to here is that we have differing definitions of necessary complexity.
You did mention you work in data analytics and I have worked and am working in more traditional product-oriented gigs. Everything you mentioned are nice, impressive even, but to me they are toys. I don't need a queue with a WAL stream, or PITR backups. Queues, to me, are inter-process communication mechanisms, or a means to concurrency. In fact, worst case, you can delete the queues or restart Redis even without triggering a disk write (though I note this is another feature that comes free with Redis); it would inconvenience our customers but they can always just retry.
Of all the benefits you mentioned, leader-election is the only one I could make a good case for.
> then you may as well understand it deeply in order to wring the most productive use you can out of each OpEx dollar/man-hour spent.
Understanding it is one thing but using that understanding to create/develop (CapEx) and then maintain (OpEx) an in-DB queue feels like unnecessary costs. CapEx to install Redis is practically nil, and you seem to agree that running it is far simpler OpEx than PG too ("Well, yeah, but...").
Of course, I keep in mind your point about libraries doing the dirty work. But if it all boils down to a third party in the end, the point I'm replying to above is rather moot no? And if it comes to a third dependency anyway, I've already reasoned out my preference for an actual queue solution.
> explained in a vendor-neutral manner: SQL DDL
> You can if those features are in the SQL standard. I’ve never heard of a DBMS regressing on its level of SQL standard support.
While this is an ideal case, my experience so far shows that every major DB package relies on their own tweaks to the SQL standard, and so "If those features are in the SQL standard" turns out to be a pretty big "if" to ask. I don't worry about vendors regressing on SQL standard support but rather that the libraries for DB queues are relying on the non-SQL-standard features/behavior to mock out a full queue functionality---non standard behavior that has no guarantee of consistency across versions.
I mean, if there's a DB queue middleware/library that works across DB vendors, be it Postgres, Oracle, Maria, MySQL (with a reasonable storage engine, perhaps) then that's a convincing argument that the SQL standard can support queuing sufficiently. But otherwise, this counterpoint is a hard sell for me.
> it would inconvenience our customers but they can always just retry.
We have very different things in our queues :) An MQ, to me, is a place to put stuff that needs reliable, at-least-once delivery. Like, say, invoices. User never receives that, bad things happen.
There's very little difference between that type of data, and the data in your DB, while it's "active." It's only once the data in the queue is fully ACKed and no longer relevant that the data behaves more ephemerally than first-class DB state.
> and you seem to agree that running it is far simpler OpEx than PG too
I did not say that. In our stack, we run a Redis cluster on top of Kubernetes, because it's nominally semi-stateless. In reality, it's our biggest SPOF. It fights the container daemon around memory allocations (why does a Redis container that uses 2GB memory steady-state need a 64GB memory limit to finish starting up!?) and has weird dynamics around startup readiness (i.e. Redis started from a large AOF-with-RDB-prolog file will start serving requests for just a moment, then immediately stop again to finish loading. This confuses k8s into thinking it's done with startup-probes and is ready for readiness-probes. Which it's not.)
Redis, when configured to be both Highly Available and durable (which is not your use-case, I know) is really just less-robust than a DB, with less management tooling, fewer docs, fewer people who know how to scale it, fewer third-party solutions pre-adapted to expect it, etc. Because, to paraphrase: Redis isn't really in the business of being a database. And yet that's exactly what you end up with, when you try to use Redis as a reliable at-least-once MQ: a database. Just, a kinda sucky one. (One that doesn't even have a concept of users with different privileges; let alone RBAC!)
And yet, that actually is the niche of Redis. It's the durable, client-server cousin to in-process peer-mesh state-sync solutions like Hazelcast/Mnesia. Which makes it kind of weird that it's not particularly tuned at solving the problems in the one niche unique to it.
(I have nothing against Redis! It has non-unique use-cases where it works great, e.g. when it's being used as "memcached with more data structures", or as a coordination/locking server instead of Zookeeper, etc. A robust reliable MQ it is not. My argument is that your average RDBMS is closer to being a robust, reliable MQ than Redis is. Though, of course, you'll get the best robustness+reliability for that use-case out of a "real" MQ — Kafka or NATS or something.)
> but rather that the libraries for DB queues are relying on the non-SQL-standard features/behavior to mock out a full queue functionality
They don't have to, because everything you need to implement reliable queuing semantics is already part of [the lowest-common-denominator part of the] SQL standard [that everything that calls itself an RDBMS implements.]
These libs have DB-specific adapters for two reasons:
1. There's no such wire protocol as "SQL" — every RDBMS speaks its own wire protocol to convey your SQL statements to the database and convey result-sets back; and so, unless you're using a language with a common API abstraction for RDBMS client libs to be built against (e.g. JDBC), then each such client lib ends up presenting its own distinct API to callers, which needs a separate adapter written for it.
2. with SQL, reliable queuing is "easy", but knowing about enqueued items (with as minimal an overhead as possible, scanning as few additional items as possible) is hard. SQL says nothing about how to get push-notification-to-the-client type information out of the DB, and so every DBMS implements its own mechanism for this. These are pure optimizations using pure-ephemeral code, though; the schema of the queue in the DB remains the same either way, such that the data could be exported from one RDBMS to a .sql file, imported from that .sql file to a different RDBMS, and the only thing needed on the client side would be switching adapters.
(In theory, presuming a common wire-protocol abstraction interface lib like JDBC, you could have a "standard SQL" adapter. It'd just be annoyingly low-throughput. Like using HTTP polling when you could be using WebSockets.)
I've got about 6 years operational experience with Redis. I would not use it for a queue. I struggle to consider using it as a data store these days as well with other options on the table. But that's a whole new thread so I'm not going into details here.
If you want a queue, get a queue.
If you want a relational database, get a relational database.
If you want a key-value store, get a key-value store.
Using a tool without knowing reasonable bounds of the domain, current requirements, and how the tool (redis in this case) solves the problem isn't good advice.
Case in point our team went with Redis, just the default, use it blindly without fully understanding our requirements and how redis helps scale.
2 years later we spent 2 sprints, holding back the release trying to understand RDF vs AOF, and why we're seeing massive spikes in consumption and performance triggering pod eviction, running comparison tests to prove which works better and explaining why, running qa tests (regression, performance, load), introducing postgres for queuing, redoing our code to bypass the sync mechanism between how data flows between redis and postgres, updating dependenciies, migrating existing customer data (various on site locations), explaining this to team members, managers and their managers, installation technicians, support engineers and presenting it at engineering townhalls as a case of bad decisions.
Well, by your admission, you used Redis for a problem domain it wasn't suited for in the first place. How is this an argument for using in-database queues?
> use it blindly without fully understanding our requirements and how redis helps scale
I'm sorry I don't get how I could come across as advocating the use of Redis blindly. My point is if your data flow looks like a queue, then use a queue, don't hack a relational DB to become a queue. I think that's reasonable rule of the thumb, not going in blind.
We needed queues. We used Redis. That fits the domain.
Problem was there wasn't a good answer to "How much redis does your team need to know to put it in production".
We thought we knew it well enough, we thought we knew what we were getting into, and we thought so many others are using it for this, we should be good. That is makes a difference, clearly.
Also reading your reply I get the impression that "sync mechanism between redis and postgress" was the bottleneck. Wondering if you can add some details around it and also was this something that can't be fixed by fine tuning redis config, rather than completely removing it from your stack.
There were many problems but at the core of it, this was us having redis write huge amounts of data to disk very frequently causing this.
We could not reduce the frequency (product would not allow) and we couldn't find a way to make the writes reliably fast.
I like to believe there exists a possible way of handling this, but point being, our team had no time to find out, how redis works internally and have confidence that the new way won't bring up new surprises.
You (or whoever is doing ops) will have to pay with your time to operate Redis until the product dies though.
I don't think adding another component (especially one with storage/backup requirements/more complex failover procedures) should be taken lightly.
Paying attention to the tips/tricks/gotchas is something you'd pay once and hopefully document in source code/documentation.
If you piggyback off someone else's work (another team at your job, a cloud-based solution that handles everything, etc) the calculus would change, of course.
> You (or whoever is doing ops) will have to pay with your time to operate Redis until the product dies though.
Repeating my other comment: do you mean to say using an in-DB queue will not need special monitoring other than what's already standard metrics for DBs?
Since a queue is a different structure, I would want to know metrics that might be unfamiliar for standard DB monitoring. Queue length, queue throughput, etc. Can I get that from my DB queue with negligible cost?
Another thing, there might be standard practices and patterns for in-DB queues, often encapsulated in a nice library/middleware. But as databases are not queues there is no assurance from the vendor that those patterns will remain applicable in future versions.
> Paying attention to the tips/tricks/gotchas is something you'd pay once
Actually, I disagree with this. Everytime someone connects to that database, they have to remember that one special table which is used as a queue, and remember to tread differently where it is concerned.
Setting up a queue? You do that once and everyone who interacts with it will not forget it is a queue. Heck you can't even do your usual DB operations there. How can you not remember it is a queue?
In fairness, Redis has only been available for 12 years so someone who has been in the industry longer has probably encountered systems using DBs as queues for no better reason than an alternative was not available when the system was made. (Rabbit just a couple years older I think.)
But in this day and age, you have better options to start with c'mon.
> they spent too much time fiddling with RabbitMQ, when Postgres would have been fine.
Rather ironic claim in a thread full of tips, tricks, and gotchas just to make PG behave like a queue no?
I can't speak for RabbitMQ personally but Redis has got to be among the simplest drop-in dependencies I've ever encountered. For a simple queue, the defaults do out-of-the-box.
Gut feel also says, Redis defaults for queuing would also serve more than PG defaults for queuing so you could leave it untouched for far longer to focus on your product, get to profitability, etc. Feel free to debunk with data.