Kafka for small message volumes is one of those distinct resume-padding architec...

hiAndrewQuinn · 2025-02-19T06:47:06 1739947626

Apt time to mention the classic "Command-line Tools can be 235x Faster than your Hadoop Cluster", for those who may have not yet read it.

https://adamdrake.com/command-line-tools-can-be-235x-faster-...

kvakerok · 2025-02-19T01:50:09 1739929809

You haven't seen the worst of it. We had to implement a whole kafka module for a SCADA system because Target already had unrelated kafka infrastructure. Instead of REST API or anything else sane (which was available), ultra low volume messaging is now done by JSON objects wrapped in kafka. Peak incompetence.

Joel_Mckay · 2025-02-19T03:15:42 1739934942

We did something similar using RabbitMQ with bson over AMQP, and static message routing. Anecdotally, the design has been very reliable for over 6 years with very little maintenance on that part of the system, handles high-latency connection outage reconciliation, and new instances are cycled into service all the time.

Mostly people that ruminate on naive choices like REST/HTTP2/MQTT will have zero clue how the problems of multiple distributed telemetry sources scale. These kids are generally at another firm by the time their designs hit the service capacity of a few hundred concurrent streams per node, and their fragile reverse-proxy load-balancer CISCO rhetoric starts to catch fire.

Note, I've seen AMQP nodes hit well over 14000 concurrent users per IP without issue, as RabbitMQ/OTP acts like a traffic shock-absorber at the cost of latency. Some engineers get pissy when they can't hammer these systems back into the monad laden state-machines they were trained on, but those people tend to get fired eventually.

Note SCADA systems were mostly designed by engineers, and are about as robust as a vehicular bridge built by a JavaScript programmer.

Anecdotally, I think of Java as being a deprecated student language (one reason to avoid Kafka in new stacks), but it is still a solid choice in many use-cases. Sounds like you might be too smart to work with any team. =3

vips7L · 2025-02-19T15:39:40 1739979580

> Anecdotally, I think of Java as being a deprecated student language (one reason to avoid Kafka in new stacks), but it is still a solid choice in many use-cases. Sounds like you might be too smart to work with any team. =3

Honestly from reading this it seems like you’re the one who is too smart to work with any team.

javaunsafe2019 · 2025-02-19T06:59:17 1739948357

I don’t know why but I could wear you are German (and old)

Joel_Mckay · 2025-02-19T08:16:29 1739952989

I like working with folks that know a good pint, and value workmanship.

If you are inferring someone writing software for several decades might share, than one might want to at least reconsider civility over ones ego. Best of luck =3

javaunsafe2019 · 2025-02-19T10:44:16 1739961856

Neither being German or old are bad values from my point of view. But you tried a bit hard to flex with your past experiences tbh...

Joel_Mckay · 2025-02-19T19:56:02 1739994962

Many NDA do not really ever expire on some projects, most work is super boring, and recovering dysfunctional architectures with a well known piece of free community software is hardly grandstanding.

"It works! so don't worry about spending a day or two exploring..." should be the takeaway insight about Erlang/RabbitMQ. Have a wonderful day. =3

kvakerok · 2025-02-21T08:15:43 1740125743

Coincidentally another SCADA module we made was handling bi-directional RabbitMQ comms. Not everyone is a one-trick pony :)

Joel_Mckay · 2025-02-21T09:30:56 1740130256

With legacy equipment there is usually no such thing as a homogeneous ecosystem, as vendor industrial parts EOL all the time. Certainly room in the markets for better options with open protocols. =3

kevinherron · 2025-02-19T02:14:54 1739931294

> for a SCADA system

for Ignition?

SteveNuts · 2025-02-19T02:42:30 1739932950

Probably Wonderware

kvakerok · 2025-02-20T19:27:07 1740079627

atmosx · 2025-02-19T07:30:11 1739950211

Oh no!

Let’s be real: teams come to the infra team asking for a queue system. They give their requirements, and you—like a responsible engineer—suggest a more capable queue to handle their needs more efficiently. But no, they want Kafka. Kafka, Kafka, Kafka. Fine. You (meaning an entire team) set up Kafka clusters across three environments, define SLIs, enforce SLOs, make sure everything is production-grade.

Then you look at the actual traffic: 300kb/s in production. And right next to it? A RabbitMQ instance happily chugging along at 200kb/s.

You sit there, questioning every decision that led you to this moment. But infra isn’t the decision-maker. Sometimes, adding unnecessary complexity just makes everyone happier. And no, it’s not just resume-padding… probably.

InDubioProRubio · 2025-02-19T09:09:58 1739956198

Then all the guys who requested that stuff quit

dude187 · 2025-02-19T21:32:50 1740000770

Well duh! They got a kafkaesque promotion using their upgraded resume!

kyawzazaw · 2025-02-19T08:04:05 1739952245

We have way way way less than that in my team. But they don't support anything else.

FearNotDaniel · 2025-02-19T06:34:08 1739946848

That’s almost certainly true, but at least part of the problem (not just Kafka but RDD tech in general) is that project home pages, comments like this and “Learn X in 24 hours” books/courses rarely spell out how to clearly determine if you have an appropriate use case at an appropriate scale. “Use this because all the cool kids are using it” affects non-tech managers and investors just as much as developers with no architectural nous, and everyone with a SQL connection and an API can believe they have “big data” if they don’t have a clear definition of what big data actually is.

tstrimple · 2025-02-19T01:33:48 1739928828

Or, as mentioned in the article, you've already got Kafka in place handling a lot of other things but need a small queue as well and were hoping to avoid adding a new technology stack into the mix.

evantbyrne · 2025-02-19T02:29:07 1739932147

It really is a red flag dependency. Some orgs need it... Everyone else is just blowing out their development and infrastructure budgets.

bassp · 2025-02-19T03:42:56 1739936576

I use Kafka for a low-message-volume use case because it lets my downstream consumers replay messages… but yeah in most cases, it’s over kill

ofrzeta · 2025-02-19T05:44:06 1739943846

That was also a use case for me. However at some point I replaced Kafka with Redpanda.

drinker · 2025-02-20T02:08:17 1740017297

Isn't redpanda built for the same scale requirements as Kafka?

rockwotj · 2025-02-20T03:01:12 1740020472

Redpanda is much more lean and scales much better for low latency use cases. It does a bunch of kernel bypass and zero copy mechanisms to deliver low latency. Being in C++ means it can fit into much smaller footprints than Apache Kafka for a similar workload

drinker · 2025-02-20T04:09:18 1740024558

Those are all good points and pros for redpanda vs Kafka but my question stills stands. Isn't redpanda designed for high-volume scale similar to the use cases for Kafka rather than the low volume workloads talked about in the article?

rockwotj · 2025-02-20T05:42:06 1740030126

When the founder started it was designed to be two things:

* easy to use * more efficient and lower latency than the big resources needed for Kafka

The efficiency really matters at scale and low latency yes but the simplicity of deployment and use is also a huge win.

munksbeer · 2025-02-20T11:21:08 1740050468

In kafka, if you require the highest durability for messages, you configure multiple nodes on different hosts, and probably data centres, and you require acks=all. I'd say this is the thing that pushes latency up, rather than the code execution of kafka itself.

How does redpanda compare under those constraints?

rockwotj · 2025-02-21T01:33:10 1740101590

Oh if you care about durability on Kafka vs Redpanda, see https://www.redpanda.com/blog/why-fsync-is-needed-for-data-s..., acks=all does not fsync (by default before acknowledging the write), so it's still not safe. We use raft for the data path, a proven replication protocol (not the custom ISR protocol) and fsync by default for safety (although if you're good with relaxed durability like in Kafka you can enable that too: https://www.redpanda.com/blog/write-caching-performance-benc...).

As for Redpanda vs Kafka in multi AZ setups and latency, the big win in Redpanda is tail latencies are kept low (we have a variety of techniques to do this). Here's some numbers here: https://www.redpanda.com/blog/kafka-kraft-vs-redpanda-perfor...

Multi AZ latency is mostly single digit millisecond (ref: https://www.bitsand.cloud/posts/cross-az-latencies/) and the JVM can easily take just as long during GC, which can drive up those tail latencies.

enether · 2025-02-22T18:32:29 1740249149

It's pretty safe. Kafka replicates to 3 nodes (no fsync) before the request is completed. What are the odds of all 3 nodes (running in different data centers) failing at the same time?

rockwotj · 2025-02-23T04:49:54 1740286194

Generally if you care about safety then “pretty safe” doesn’t cut it.

enether · 2025-02-23T10:01:26 1740304886

It's just my polite way of saying it's safe enough for most use cases and that you're wrong.

The fsync thing is complete FUD by RedPanda. They later introduce write caching[1] and call it an innovation[2]. I notice you also work for them.

Nevertheless, those that are super concerned with safety usually run with an RF of 5 (e.g banks). And you can configure Kafka to fsync as often as you want[3]

1 - https://www.redpanda.com/blog/write-caching-performance-benc... 2 - https://www.linkedin.com/posts/timfox_oh-you-cant-make-this-... 3 - https://kafka.apache.org/documentation/#brokerconfigs_log.fl...

deniscoady · 2025-02-23T20:54:48 1740344088

Disclaimer: I currently work for Redpanda.

It's just my polite way of saying it's safe enough for most use cases and that you're wrong.

Low volume data can be some of the most valuable data on the planet. Think SEC reporting (EDGAR), law changes (Federal Register), court judgements (PACER), new cybersecurity vulnerabilities (CVEs), etc. Missing one record can be detrimental if its the one record that matters.

Does everyone need durability by default? Probably not, but Redpanda users get it for free because there is a product philosophy of default-safe behavior that aligns with user expectations - most folks don't even know how this stuff works, why not protect them when possible?

The fsync thing is complete FUD by RedPanda.

You want durability? Pay the `fsync()` cost. Otherwise recognize that acknowledgement and durability are decoupled and that the data is sitting in unsafe volatile memory for a bit.

They later introduce write caching[1] and call it an innovation[2].

There are legitimate cases where customers don't care about durability and want the fastest possible system. We heard from these folks and responded with a feature they can selectively opt-in for that behavior _knowing the risks_. Again the idea is to be safer by default, and allow folks to opt-in to more risky behaviors.

those that are super concerned with safety usually run with an RF of 5 (e.g banks)

Going above RF=3 does not guarantee "more nines" since you need more independent server racks, independent power supplies or UPSs, etc, otherwise you're just pigeonholing yourself. This greatly drives up costs. Disks and durability is just cheaper and simpler. Worst case you pull the drives and pull the data off them, not fun and not easy, but possible unlike in-memory copies.

And you can configure Kafka to fsync as often as you want[3]

Absolutely! But nobody changes the default which is the issue - expectations of new users are not aligned with actual behavior. Same thing happened during the early MongoDB days. Either there needs to be better documentation/education to have people understand what the durability guarantees actually are, or change the defaults.

enether · 2025-02-23T21:59:29 1740347969

I agree that data can be valuable and even one record loss can be catastrophic.

I agree that there needs to be better documentation.

I just don't agree that losing 3 replicas each living in a different DC at once is a realistic concern. The ones that would truly be concerned about this issue would do one of two things - run RF>3 (yes, it costs more) or set up some disaster recovery strategy (e.g run in multiple regions, yes that costs more.)

Because truth be told - losing 3 AZs at once is a disaster. And even if you durably persisted to disk - all 3 disks may have become corrupt anyway.

agallego · 2025-02-23T14:04:01 1740319441

It is not FUD. It is deterministic. Reproducible on your laptop. Out of all the banks I work with only a handful of use cases use rf=5. Defaults matter, because most people do not change them.

enether · 2025-02-23T22:00:17 1740348017

Defauls do matter, in principle. But I think this particular risk is overblown, see my other reply for my thoughts

cheema33 · 2025-02-19T15:35:19 1739979319

I needed to synchronize some tables between MS SQL Server and PostgreSQL. In the future we will need to add ClickHouse database to the mix. When I last looked, the recommended way to do this was to use Debezium w/Kafka. So that is why we use it. Data volume is low.

If anybody knows of a simpler way to accomplish this, please do let me know.

lijok · 2025-02-19T23:48:17 1740008897

We used a binlog reader library for Python, wrapped it in some 50 loc of rudimentary integration code and hosted it on some container somewhere.

Data volume was low though.

wink · 2025-02-19T12:21:12 1739967672

Don't disagree on the resume-padding but only taking into account message volume and not the other features is also not the best way to look at it.

Have I used (not necessarily decided on) Kafka in every single company/project for the last 8-9 years? Yes.

Was it the optimal choice for all of those? No.

Was it downright wrong or just added for weird reasons? Also no, not even a single time - it's just kinda ubiquitous.

eBombzor · 2025-02-19T05:18:20 1739942300

How are we defining small message volumes?

gottorf · 2025-02-19T03:12:30 1739934750

Resume-driven development. Common antipattern.