A while back I was on team building a non-critical,
low volume application.
It just involves people sending a message basically.
(there is more too it).
The consultants said he had to use Kafka because the messages could come in really fast.
I said we should stick with Postgres.
No, they said, we really need Kakfa to be able to handle this.
Then I went and spun up Postgres on my work laptop (nothing special), and got a loaner to act as a client.
I simulated about 300% more traffic than we had any chance of getting. It worked fine. (did tax my poor work laptop).
No, we could not risk it, when we use Kafka we are safe.
Took it to management, Kafka won since Buzzword.
Now of course we have to write a process to feed the data into Postgres. After all its what everything else
depends on
People tend to not realize how big "at scale" problems really are. Instead anything at a scale at the edge of their experience is "at scale" and they reach for the tools they've read one is supposed to use in those situations. It makes sense, people don't know what they don't know.
And thus we have a world where people have business needs that could be powered by my low end laptop but solutions inspired by the megacorps.
The GHz aspect of Moore's law died over a decade ago, and I suppose it's fair to say most other stuff has also slowed down, but if you've got a job that is embarrassingly parallel, which a lot of these "big data" jobs are, people badly underestimate how much progress there has been in the server space even so in the last 10 years if they're not paying attention. What was "big data" in 2011 can easily be "spin up a single 32-core instance with 1TB RAM for a couple of hours" in 2021. Even beyond the "big data" that my laptop comfortably handles.
I'm slowly wandering into a data science role lately, and I've been dealing with teams who are all kinds of concerned about whether or not we can handle the sheer, overwhelming volume of their (summary) data. "Well, let's see, how much data are we talking about?" "Oh, gosh, we could generate 200 or 300 megabytes a day." (Of uncompressed JSON.) Well, you know, if I have to bust out a second Raspberry Pi I'll be sure to charge it to your team.
The funny thing is that some of these teams have the experience that they ought to know better. They are legitimately running cloud services with dozens of large nodes continually running at high utilization and chewing through gigabytes of whatever per second. In their own worlds they would absolutely know that a couple hundred megabytes is nothing. They'll often have known places in their stack where they burn through a few hundred megabytes in internal API calls or something unnecessarily, and it will barely rise to the level of a P3 bug, quite legitimately so. But when they start thinking in terms of (someone else's) databases it's like they haven't updated their sense of size since 2005.
You are not thinking enterprisey enough. Everything is Big Scale if you add enough layers, because all these overheads add up, to which the solution is of course more layers.
Were you advocating an endpoint + DB or different apps directly writing into a shared DB? The latter is not really a good idea for numerous reasons. Kafka a - potentially overkill - replacement for REST or whatever, not for your DB.
You don’t want stream replication for scale, you want it for availability and durability. The number of times replication has saved my ass is too damn high.
And you want Kafka (or something like it) the moment you need two processes handling updates which again you probably want for availability reasons so a crash doesn’t stop the stream.
You also don’t catch deletes or updates with this setup. But there’s a million db to Kafka connectors that read the binlog or pretend to be a replica to handle this.
>My answer to anyone who asks for kafka: Show me that you can't do what you need with a beefy Postgres.
Sorry thats just a clickbait-y statement. I love Postgres, try handling 100-500k rps of data coming in from various sources reading and writing to it.
You are going to get bottlenecked on how many connections you can handle, you will end up throwing pgBouncers on top of it.
Eventually you will run out of disk, start throwing more in.
Then end up in VACCUUM hell all while having a single point of failure.
While I agree Kafka has its own issues, it is an amazing tool to a real scale problem.
I think anotherhue would agree that half a million write requests per second counts as a valid answer to "you can't do what you need with a beefy Postgres," but that is also a minority of situations.
It's just hard to know what people mean when they say "most people don't need to do this." I was sitting wondering about a similar scale (200-1000 rps), where I've had issues with scaling rabbitmq, and have been thinking about whether kafka might help.
Without context provided, you might think: "oh, here's somebody with kafka and postgres experience, saying that postgres has some other super powers I hadn't learned about yet. Maybe I need to go learn me some more postgres and see how it's possible."
It would be helpful for folks to provide generalized measures of scale. "Right tool for the job," sure, but in the case of postgres, it often feels like there are a lot of incredible capabilities lurking.
I don't know what's normal for day-to-day software engineers anymore. Was the parent comment describing 100-500 rps really "a minority of situations?" I'm sure it is for most businesses. But is it "the minority of situations" that software engineers are actively trying to solve in 2021? I have no clue.
Note superyesh was talking about 100 to 500 thousand requests per second. Your overall question stands, but the scale superyesh was talking about is very different and I am quite confident superyesh's scale is definitely in the minority.
Oops, yes, was omitting the intended "k", totally skewing the scale of infrastructure my comment was intending to describe. Very funny, ironic. Unfortunately I can no longer edit that comment.
I’m not sure if you’re omitting the k in your numbers, or missed it in the other comment? Do you mean 100-500 and 200-1000, or 100 000-500 000 and 200 000-1 000 000?
i love posgresql, but i would not use it to replace a rabbitmq instance -- one is an RDBMS, the other is a queue/event system.
"oh but psql can pretend to be kafka/rabbitmq!" -- sure, but then you need to add tooling to it, create libraries to handle it, and handle all the edge cases.
with rmq/kafka, there already a bunch of tools to handle the exact case of a queue/event system.
I think having ad hoc query capabilities on your job queue/log (not possible with rabbit and only possible by running extra software like KQL with Kafka, and even then at a full-table-scan cost equivalent) is a benefit to using postgres that should not be overlooked. For monitoring and debugging message broker behavior SQL is a very powerful tool.
I say this as someone squarely in the "bigger than can work with an RDBMS" message processing space, too: until you are at that level, you need less tooling (e.g. read replicas, backups, very powerful why-is-it-on-fire diagnostics informed by decades of experience), and get generally higher reliability, with postgres as a broker.
Yeah, once you do have to scale a relational database you're in for a world of pain. Band-aid after band-aid... I very much prefer to just start with Kafka already. At the very least you'll have a buffer to help you gain some time when the database struggles.
Kafka has its use case. Databases have theirs. You can make a DB do what kafka does. But you also add in the programming overhead of getting the DB semantics correct to make an event system. When I see people saying 'lets put the DB into kafka' I make the exact same argument. You will spend more time making kafka act like a database and getting the semantics right. Kafka is more of a data/event transportation system. A DB is an at rest data store that lets you manipulate the data. Use them to their strengths or get crushed by weird edge cases.
I'd argue that having the event system semantics layered on top of a sql database is a big benefit when you have an immature product, since you have an incredibly powerful escape hatch to jump in and fire off a few queries to fix problems. Kafka's visibility for debugging is pretty poor in my experience.
My issues typically with layering an event system on top of a db is replication and ownership of that event. Kafka makes some very nice guarantees about giving best attempt to make sure only one process works on something at a time inside of a consumer group. You have to build a system in the db using locks and different things that are poor substitutes.
If you are having trouble debugging kafka you could use a connector to put the data into the database/file to also debug, or a db streamer. You can also use the built in cli tools to scroll along. I have had very good luck with using both of those to find out what is going wrong. Also kafka will basically by default keep all the messages for the past 7 days so you can play it back if you need to by moving the consumer offsets. IF you are trying to use kafka like a db and change messages on the fly you will have a bad time of it. Kafka is meant to be a here is something that happened and some data. Changing that data after the fact would in the kafka world be another event. In some types of systems that is a very desirable property (such as audit heavy cultures, banks, medical, etc). Now also kafka can be a real pain if you are debugging and messup a schema or produce a message that does not fit into the consumer. Getting rid of that takes some poor cli trickery when in a db it is a delete call.
Also kafka is meant for a distributed system event based worker systems (typically some sort of microservice style system). If you are early on you more than likely not building that yet. Just dumping something into a list in a table and polling on that list is a very effective way for something that is early on or maybe even forever. But once you add in replication and/or multiple clients looking at that same task list table you will start to see the issues quickly.
Using an event system like a db system and yes it will feel broken. Also vice versa. You can do it. But like you are finding out those edge cases are a pain and make you feel like 'bah its easier to do it this way'. In some cases yes. In your case you have a bad state in your event data. You are cleaning it up with some db calls. But what if instead you had an event that did that for you?
I do not disagree. Kafka is kind of a monster to fully configure correctly. Once they get rid of zookeeper it may be nicer to spin up and 'set it and forget it' sort of thing like the others.
Just because you can do it with Postgres doesn't mean it is the best tool for the job.
Sometimes the restrictions placed on the user are as important. Kafka presents a specific interface to the user that causes users to build their applications in certain way.
While you can replicate almost all functionality of Kafka with Postgres (except for performance, but hardly anybody needs as much of it), we all know what we end up with when we set up Postgres and use it to integrate applications with each other.
If developers had discipline they could of course crate tables with appendable logs of data, marked with a partition, that consumers could process from with basically same guarantees as with Kafka.
We work with a client who has requested a Kafka cluster. They can’t really say why, or what they need it for, but the now have a very large cluster which doesn’t do much. I know why they want it, same reason why they “need” Kubernetes. So far they use it as a sort of message bus.
It’s not that there’s anything wrong with Kafka, it’s a very good product and extremely robust. Same with Kubernetes, it has it uses and I can’t fault anyone for having it as a consideration.
My problem is when people ignore how capable modern servers are, and when developers don’t see the risks in building these highly complex systems, if something much simpler would solve the same problem, only cheaper and safer.
I've also had much more success with "queues in postgres" than "queues in kafka". I'm sure the use cases exist where kafka works better, but you have to architect it so carefully, and there are so many ways to trip up. Whereas most of your team probably already understands a RDMS and you're probably already running one reliably.
I had a great time with Kafka for prototyping. Being able to push data from a number of places, have mulitple consumers able to connect, go back and forth though time, add and remove independent consumer groups. Ran in pre-production very reliably too, for years.
But for a production-grade version of the system I'm going with SQL and, where needed, IaC-defined SQS.
I see posts like this a lot, and it makes me wonder what the heck you were using Kafka for that Postgres could handle, yet you had dozens of clusters? I question if you actually ever used Kafka or just operated it? Sure anyone can follow the "build a queue on a database pattern" but it falls over at the throughputs that justify Kafka. If you have a bunch of trivial 10tps workloads, of course a distributed system is overkill.
They didn’t say that the Kafka clusters they personally ran could have been handled with Postgres instead.
They first gave their credentials by mentioning their experience.
Then they basically said ”given what I know about Kafka, with my experience, I require other people who ask for it to show me that they really need it before I accommodate them - often a beefy Postgres is enough”.
Not those other tools because you can't achieve Postgres functionality with those other tools generally.
Postgres can pretend to be Redis, RabbitMQ, and Kafka, but redis, RabbitMQ, and Kafka would have a hard time pretending to be Postgres.
Postgres has the best database query language man has invented so far (AFAIK), well reasoned persistence and semantics, and as of recently partitioning features to boot and lots of addons to support different usecases. Despite all this postgres is mostly Boring Technology (tm) and easily available as well as very actively developed in the open, with a enterprise base that does consulting first and usually upstreams improvements after some time (2nd Quadrant, EDB, Citus, TimescaleDB).
The other tools win on simplicity for some (I'd take managing a PostgreSQL cluster over RMQ or Kafka any day), but for other things especially feature wise Postgres (and it's amalgamation of mostly-good-enough to great features) wins IMO.
> My answer to anyone who asks for kafka: Show me that you can't do what you need with a beefy Postgres.
...
You could probably hack any database to perform any task, but why would you? Use the right tool for the right task, not one tool for all tasks.
If the right tool is a relational database, then use Postgres/$OTHER-DATABASE
If the right tool is distributed, partitioned and replicated commit log service, then use Kafka/$OTHER-COMMIT-LOG
Not sure why people get so emotional about the technologies the know the best. Sure you could hack Postgres to be a replicated commit log, but I'm sure it'll be easier to just throw in Kafka instead.
Something I find peculiar about this truth is that Postgres itself is based on the write ahead log structure for sequencing statements, replication, audit logging.
It feels like two different lenses onto the same reality.
Can you kinda high level the setup/processes for making Postgres a replacement for Kafka? I've not attempted such a thing before, and wonder about things like expiration/autodeletion, etc. Does it need to be vacuumed often, and is that a problem?
Honest question, how do you expire content in Postgres? Every time I start to use it for ephemeral data I start to wonder if I should have used TimescaleDB or if I should be using something else...
You likely already know that TimescaleDB is an extension to PostgreSQL, so you get everything you'd get with PostgreSQL plus the added goodies of TimescaleDB. All that said, you can drop (or detach) data partitions in PostgreSQL (however you decide to partition...) Does that not do the trick for your use case, though? https://www.postgresql.org/docs/10/ddl-partitioning.html
Maybe something like https://github.com/citusdata/pg_cron or if it's not a rolling time window, just trigger functions that get called whenever some expression in SQL is true.
using a sql db for push/pop semantic feels like using a hammer to squash a bug.. How would you model queues & partitions with ordering guarantees with pg ?
With transactions, and stored procedures if that helps ;). Redis also seems well suited to the use cases I've seen for Kafka. Kafka must have capabilities beyond those use cases, and I've sometimes wondered what they are.
It depends. Are your reads also writes (e.g. simulating a job queue with a "claimed" state update or a "select for update ... skip locked")? If not, 500m should be no sweat, even on crappy hardware with minimal tuning. If so, things get a little trickier, and you may have to introduce other tools or a clustering solution to get that throughput with postgres.
My answer to anyone who asks for kafka: Show me that you can't do what you need with a beefy Postgres.