One common anti-pattern is that "event driven" can lead to a push architecture that works on paper, but is unreliable in practice.
The idea looks deceptively simple. If you need something to happen, you just broadcast the right event, someone else acts on it and everything is fine. You code it, it works in the demo, and you deploy to production.
And then you run across network failures. Missed events. Doubled events. Events that arrived while a database was down. Events that were missed due to a minor software bug. And so on. You layer on fixes, and the system develops more complex edge cases, and more subtle bugs.
Then you rewrite from a push based system to a pull based system, and all of your complex edge cases disappear. :-)
I worked with a team that were making quite a complex multi-part network manager some years ago. Development was going to be superfast because the lead developer had come up with this new framework (red flag! red flag!) based on publishing events and then subscribers picking them up and acting on them.
The problems began when the subsystem running user interaction started to need replies to specific things so it could tell what was going on with specific user interactions. Then message timeouts where needed so that pieces could react if their response didn't manifest in time.
What this framework unintentionally did, I cynically worked out after a while, was implement UDP multicast over TCP.
What I was watching happen, each time messaging problems come up, was the reimplementation of TCP on top of that. That's about when I bailed!
... damn. I'm a fan of message-driven systems (in theory, never actually worked on them in practice), but I will admit that "UDP multicast over TCP" sounds like a very appropriate description.
Well if it's any consolation I don't think it's inevitable, just something to watch out for. In this case I think the problem and solution domains were poorly mapped.
So what happens when your push based system has a network outage? A retry? Stall the system? Reboot? queue the operation? what?
I think doing "event driven" without an event log for critical system functionality is probably the anti-pattern you are describing here. with my event log, worst case scenario is i need to reprocess all the messages in my system to recover from all of the above.
That said, most of the things u mentioned can be mitigated by a decent event bus with a competing consumer.
ie. isolate your critical components (write model) into small succinct peices that use highly reliable message delivery techniques. once committed, broadcast to other queues where delivery isnt that critical.
An event based system with a reliable event log becomes something that you can do a resync on. Which makes it into a pull based system in time of need.
Alternately you can have some sort of confirmation protocol. It is easy to go too far, but the kind of confirmation/resend logic that turns UDP into TCP has very much demonstrated its value in practice.
It's fine to make a push system, but you're most likely to still need pull mechanics anyway. If my component get a push from you, but I can see that I must have missed some events, I need to request them. You still need a heartbeat mechanics, so you know you would get all events.
Of course, if the events become irrelevant after the fact you don't need this, but only very few systems are like that.
The first concrete thing I learnt is this - implement pull first, it works 100% of the time, but may be inefficient with regards to time. Then implement push, it works 99% of the time but is much faster. But always have both running.
I'm totally in agreement with this. Both processes should also be idempotent (you should be able to pull multiple times without side effects, and push and pull should be able to happen at the same time without side effects).
When everything is working well, the 'push' does all the work, and though the 'pull' runs every few minutes/seconds/whatever it never has anything to do.
This same thing applies to time-based events: your system should not assume that the process is always running, so if something needs to happen at exactly 9:00am (and it's not okay to just skip it if missed), it should be able to run anytime later with the same outcome as if it ran at 9:00am.
When I was first getting into this, it helped me to understand that events must follow one of the semantics: at-most-once, at-least-once and exactly-once -- with the trick that exactly-once is not strictly possible [1].
There are only two hard problems in distributed systems:
2. Exactly-once delivery
1. Guaranteed order of messages
2. Exactly-once delivery
-Mathias Verraes [2]
> And then you run across network failures. Missed events. Doubled events. Events that arrived while a database was down. Events that were missed due to a minor software bug. And so on. You layer on fixes, and the system develops more complex edge cases, and more subtle bugs.
Sure...if all the ilities weren't accounted for in the original design.
Async or out-of-process event systems have increased points of failure. If those aren't accounted for, then yes -- problems occur.
Sure...if all the ilities weren't accounted for in the original design.
Virtually nobody is able to account for everything in the original design. Convincing yourself that you got it right is easy. Actually getting it right is HARD. It is possible. For example zookeeper seems to have. But your odds of success are very, very low.
See https://aphyr.com/tags/jepsen for many, many examples of competent people who thought they had it right, being proven wrong. Over and over again.
Failure is not only expected in distributed systems, it is mathematically impossible to avoid. The best that you can do is document your failure modes and what guarantees you will provide despite that.
Unfortunately, as aphyr proves over and over again, is that the guarantees we are given for how distributed software is supposed to work don't hold. Over and over again, across virtually every piece of well-known distributed piece of software. And my experience is that in house software is at least an order of magnitude worse.
However you think your software will work, it doesn't.
Sounds great. But... how do you get from "a push architecture", if this is the mental model you start with, because it's obvious and natural, to a "pull based system"?
Is there any receipe for this in general? Once you start with a model of a push architecture... don't you kind of have it as the only sane model?
...How do you ask for something that may or may not yet exist? Maybe some data that you don't want to care if it's about the past (already exist), abut the present and in the process of being computed, or about the future and not existing yet, not even the request for its computation is fired - yet you just want to write the same code that says "do this with this kind of data, always, whenever, wherever", I don't care if it the data does not exist yet now. Or how do you ask "how many time X happened since last time I asked"?
You'll end up implementing some kind of pooling loop, and sooner or later voila, you've reimplemented an event loop, and now you have a badly ad-hoc implemented event driven system anyway.
The only way to handle a "naturally push based system" is to accept that this is the natural way for it to be, a find a declarative way to express it as a rules based system instead of tangles of imperative event handlers. But this is really hard! So you settle for the "push based system" or "event driven" system instead quite often...
Have you ever queried a database? That's how. You just ask. If you don't get data, do deal with it.
Yes, there is a polling loop. Is this bad? Nope. You just have to understand the system you're building. Will you get data in a mostly consistent timing? Polling works. Is is sporadic? Try event driven. One size does not fit all. Event driven should not be used for everything and it's not in any way more natural than polling.
Definitely varies by the system, but, in my small-business internal-software experience, 99% of the time polling is good enough. Something feels innately ugly about polling, but it's simple, has failure modes that are easy to understand, and gets the job done.
In a pure push, how do you ask for data to be resent?
There's trade-offs either way. No way is "best"; it all depends on what you're doing. I communicate with equipment over TCP, that I have to poll. There's _no_ mechanism for push. You just work with what you have and do the very best you can.
Our IT architects love event driven stuff but it always ends in hellish systems where you have to add retries on the publisher side and pull on the client side.
I've used events successfully for an important fraud detection system. There are a set of known events. A fraud model could hook into the event chain at any point. It would then produce an alert event with its findings. Downstream another handler listened for it. Coupled with RabbitMQ, the system is a pleasure to expand. All the fraud models are micro services. Just a start up script and boom, hooked in.
We planned for duplication of messages. Operations are idempotent. If the message already ran, we could rerun the calculation safely. With Rabbit the only duplicate messages were the ones humans sent to rerun a model that broke.
This is important. There is no "exactly once" system; you can have at-most-once where things get occasionally lost, or at-least-once where you resend things that you can't be sure have been recieved. The latter loses less data but needs methods for detecting or otherwise ignoring duplicates.
NFS goes for "idempotent operations", SMTP goes for message-IDs that can be used for deduplication.
Sounds like a system I built, but what I didn't account for is how often the RabbitMQ cluster would get tanked by my coworkers. The system fails when messages start getting dropped on the floor due to server reboots. Still haven't come up with a great solution, but what we have is "good enough" for that project's requirements. It was a great learning experience though, an uncomfortable reminder of the difference between reliability and durability, and it definitely helped inform subsequent architectural decisions.
I've written a system like that and it was also composed of microservices. I had one which was running on my computer for two months without a restart and it did not fail once. You can build robust systems on an AMQP backed MQ if you take care.
Event-Driven software comes up on many of my projects. What I find works well is using command-query separation (CQS, CQRS) to start because CQS is a lever into two complementary ideas: backend implementation and continuous improvement.
The backend implementation ideas involve REST, caching, replication, eventual consistency, CAP, and especially PACELC.
The continuous improvement ideas involve upgrading existing software projects from imperative styles to functional styles, phasing in publish-subscribe modules, growing codebases to be more event-driven, and planning for evolutionary architecture.
In practice, I see these areas need management help, such as funding and time for training, and help for teams creating infrastructure as code (IaC) to deploy many event-oriented modules as well as event monitoring and management.
I had not heard of PACELC and was excited to hear something new, only to see it as a very natural extension of CAP. It reminds me of, back in math class, you would learn of a new theorem that was only different than another because they singled out a particular case, like for 1 or 0 :)
I hadn't heard of either, but they both make perfect sense - so I for one say thanks for that great post. :-) Did study up on CQRS/ES year and found it very exciting.
My first contact with event driven programming was coding for the original Windows C API, back in 1990. Nowadays we call that API Win32, but then it may have been win 16. Switching from the traditional imperative style to callbacks was a big deal: don't call us, we'll call you! [1] and [2] show the difference in coding style. Later in the 90s C++ wrappers like MFC and XVT simplified things, but the paradigm was still an inversion of control. And that's what event driven means to me.
Unless you do async operations, UI framework programming with callbacks doesn't really approach the full horror that can exist in some styles of event-based programming.
UI events are typically of two kinds: user performed an action and you need to update the model / execute a command, or the view is asking for information and you need to translate it from the model. Both are fairly encapsulated; the larger operation doesn't extend beyond the scope of the callback, and if it does (e.g. showing a modal dialog), you can use a nested event loop to take care of event dispatching while showing the modal.
The events in a UI are also naturally scoped to the screen being shown. When the screen goes away, you can rely on those events not being dispatched any more. Events get to party on a shared state (the model, perhaps attributes of the view) but scoped within the view being displayed. It's all pretty flat. You also don't need to worry much about concurrency, unless doing async.
I'm not sure... sounds like CQRS (though it's a term I don't think I've ever seen before...) is independent. Like, you can have one data structure for read/write, and you just don't get public access to the write interface directly, only through functions that add to the event list and then do the event. Along the lines of something vaguely like this, perhaps:
I think that most of the time event-sourcing will impact your model/business-logic layer, because you often want to capture the intent of changes, rather than calculating a naive combined delta of all changes that occurred in memory.
If so, that means that instead of a CRUD interface like:
That way you capture each change-event at the right granularity with the necessary data. Finally, CQRS comes into play because you want to do something with that rich event data, such as eagerly populating a table:
Isn't it the eventual consistency model that implies CQRS? If you have synchronous event-sourcing you can have a single model/service for reads and writes.
Blatant plug: Urbit (urbit.org) is a general-purpose OS built on the event-sourced model. Urbit's state is a small, frozen function of its event log.
I don't know of any other generalized event-sourced OSes. But almost every principle Fowler describes in special-purpose event-sourced designs also seems to apply in the general-purpose case.
For example, the connection between CQRS and event-sourcing seems very deep and natural. The connection to patches in a revision-control system is also completely apropos.
When Fowler starts pointing at advanced problems like "we have to figure out how to deal with changes in the schema of events over time," you start to see a clear path from special-purpose event-sourced apps to general-purpose event-sourced system software.
Broadly speaking, one category of answer is: abstract the semantics of your app, until it becomes a general-purpose interpreter whose source code is delivered and updated in the event stream. Then the new schema is just a source patch. And your app is not just an app, but actually sort of an OS.
> Blatant plug: Urbit (urbit.org) is a general-purpose OS built on the event-sourced model.
I've looked through its documentation, and while I couldn't make heads or
tails of it, Urbit seems to be everything but an operating system. Espeically that
it needs unix system to run.
Urbit is an OS in that it defines the complete lifecycle of a general-purpose computer. It's not an OS in that it's the lowest layer above bare metal.
Sorry you had a bad experience with the docs. Urbit is actually much stupider than it looks.
Unfortunately, premature documentation is the second root of all evil, so some of the vagueness you sense is due to immature code. It's always a mistake to document a system into existence -- that is the path of vaporware. Better to run and have weak docs, than have good docs but not run.
I agree with you on the documentation thing but, what we see (and I experience also) is that most of the time, when things get serious enough (production), there isn't any available time to document the core of the software properly, let alone document it fully.... "lack of time" in this context meaning either (1) management won't let you stop implementing stuff they need "for yesterday" (everything is critical priority) and document for 10-20 hours or (2) your self run project needs more features and it's always cooler to study/code than to document puzzles you have already solved.
I'm not criticizing, I am exposing that I am part of the problem and lost, and I'd like to know how people handle this on their lives. I want to document more but there's always this rush to implement implement implement and not once, during some catastrophic breakdown of something, management wants everything solved quickly but they expect you to remember instantly of everything you did 2+ years ago :-)
PS: ... and they'll get mad if you say "you didn't let me document this damn thing"
It's very easy to create an event-driven program that is effectively a larger program implemented in terms of "come-from". The decoupling provided by eventing is often illusory; if there's a higher-level semantic that's supposed to be implemented in terms of the outcome of coordinated event handlers, then the dependencies still exist - they're just implicit.
If event-handlers are genuinely decoupled from one another, and there isn't a higher-level state machine that's being driven by events, then it can be an excellent way to structure logic.
Events-as-deltas that are durably stored and can be reliably replayed is another fine way to go for a system with checkpoints, auditing, history, rollback, time-series reporting and similar requirements. But like the article says, code will be much simpler if it deals with an eternal present and the calculation and application of the deltas is centralized in code that doesn't change often.
There's an isomorphism between building a bigger program out of event handlers, and distributed programming. Distributed programming is known to be difficult, and doing it at the message send level is IMO too low a level - one should build more expressive and composable primitives and work at a higher level. That's if it's warranted at all.
"The decoupling provided by eventing is often illusory; if there's a higher-level semantic that's supposed to be implemented in terms of the outcome of coordinated event handlers, then the dependencies still exist - they're just implicit." - I've experienced the same myself and converted my "events" into coupled calls. More expressive.
I find the topic of event sourcing / CQRS a little frustrating. Somewhat tangential, but I've slightly similar experiences with rxjs as well.
For both, whenever I read about em or I hear someone discussing em, it makes so much sense in my head! But later when I try to execute on the ideas, I can't make it "come together". It's like some puzzle piece refuses to click into place.
In the cases when I've tried hacking something together, I've always ended up with something that seemed more brittle than if I'd just gone with a safe relational database. With that said, I'm also completely open to the possibility that I've just been trying to apply the pattern in cases where it's not a good fit.
I'd seriously love to poke around in a real-world application showcasing CQRS. I don't expect something perfect, but getting a chance to look at what tradeoffs and concessions a bunch of engineers made would surely be incredibly insightful.
> In the cases when I've tried hacking something together, I've always ended up with something that seemed more brittle than if I'd just gone with a safe relational database. With that said, I'm also completely open to the possibility that I've just been trying to apply the pattern in cases where it's not a good fit.
There is a 95% chance that event sourcing and CQRS were entirely the wrong patterns for your use case. If you're not dealing with a very complex domain (like, say, automatically orchestrating a global logistics or drug discovery pipeline with audit logs for regulatory compliance), that chance becomes 99.9%. If you're working on something that only you or a few people will ever work on, it's 110%. Just the tooling for a reliable event sourcing system would be a herculean task for a single developer and that's without even writing a single line of business logic!
I don't know of any open source real world examples of CQRS off the top of my head and the only successful ES/CQRS systems I've seen in the wild were projects that involved hundreds of domain experts and programmers, dealing with extremely complicated fields like biotechnology and electrical engineering where nonprogrammers spent about as much time looking at code as the programmers. Any good ES/CQRS project will contain a lot of the business's secret sauce so it's not something Google or Facebook would open source.
I experienced something very similar with rxjs (and before that just Rx in .Net). I've studied Rx for a couple years, and I was developing event-sourcing ideas since before it became a big thing back around 2000, and I've done my fair share of head scratching in these areas. (I arrived at the event-sourcing stuff via a very abstract, conceptual route largely investigating physics/information theory...) It has since morphed into its present-day incarnation in ibGib (github.com/ibgib/ibgib and www.ibgib.com).
I don't know how real-world it is, since I'm still pretty much the only one who is using it, but you may enjoy checking it out for a slightly different view on ES.
It doesn't strictly fit the definition of CQRS at the moment. This is because I see CQRS as an implementation detail orthogonal to ES, but which dovetails nicely as they both work well with distributed systems. ES has immutable events (monotonically increasing), and CQRS implies a distributed system and allows for more effecient retrieval, leveraging ES's immutability.
But ibGib does provide what I see as an augmentation, or maybe an evolution, to ES. It has shifted over the years away from ontological "events" and now is more biological in approach. All of the data is in terms of ibGib, which is a flat database structure comprising `ib`, `data`, `rel8ns` and a `gib` which is a hash of the other three fields. The ES "event" analog specifically about what is replayed to hydrate domain aggregates can be seen in a specific `rel8n` called `dna`. This contains the code necessary to "rebuild" that ibGib immutable frame. This came largely from my dislike of the foundation of the entire river of the source of truth being necessary. With the DNA structure, it actually keeps a dependency graph, which ends up being a projection of the "entire" source of truth. This allows for nodes to "replicate" ("reproduce" would be a better term) as projections of entire graphs. These kinds of deep aspects are really neat!
But anyway, I digress. The server side is written in Elixir, which runs on top of Erlang's vm (the BEAM(!)). And you can visually interact with the structure, as the web app runs on d3. I have some demos on YouTube on the website if you just want to look at it, though I don't go into the DNA/ES aspect of it.
Anyway, I had to say something, since ES is a topic near and dear and so closely related to ibGib's current architecture. My design decisions could very well help you & others understand ES more deeply. Plus I've been enjoying getting to a point where ibGib is actually usable (and useful).
NOTE: I'm somewhat arbitrarily responding to you in this thread, as I used to play halo with an AceOfHearts.
I'd like to see a simple e-commerce cart written using all of the patterns. That's really the minimal "hello world" for comparing the patterns he discusses.
A good summary. However too many choose to view it only as Event-Sourcing/CQRS, a silver-bullet applicable to a minority of applications. One particularly attractive in Microsoft projects, since it is the only alternative enterprise architecture "practice" they propose to data modeling. Object domain modeling continues to be an anathema, because I think of a wizard and code-generation driven historical perspective.
In commercial systems, transactions as actions and events, recorded for business or legal reasons, are central. That includes significant "events" like purchases, sales, reservations etc. In that sense, they all are event driven.
Event-Driven as an architecture that is based on recording changes to a baseline state, should be applied only where really suited.
I would highly recommend spending an evening reading through Martin Fowler's blog. Has has a gift of breaking down concepts into very pragmatic and digestible explanations.
The idea looks deceptively simple. If you need something to happen, you just broadcast the right event, someone else acts on it and everything is fine. You code it, it works in the demo, and you deploy to production.
And then you run across network failures. Missed events. Doubled events. Events that arrived while a database was down. Events that were missed due to a minor software bug. And so on. You layer on fixes, and the system develops more complex edge cases, and more subtle bugs.
Then you rewrite from a push based system to a pull based system, and all of your complex edge cases disappear. :-)