Extremely high-quality work here - I spend (what I think is) a significant amount of time trying to understand systems, in general and in detail, and this feels on first reading like something I'll be referring to repeatedly in future.
Already it suggests a surprising idea, which may or may not be fleshed out elsewhere, that a good way to understand an existing system is to try and solve the same problem without reference to how the existing system works:
> You will often expend more cycles understanding the existing design than you would solving the problem from first principles. ...
> Even good solutions can bias your thinking towards a particular part of the design space. ... A great time to look at other systems is after the Design phase, to see if you can map those solutions to your space. Even better, you can often reverse-engineer the details of solutions simply by understanding where they fit in your design space.
If it doesn't already exist somewhere else, we can call it Balakrishnan's Law - the best way to understand a problem is to solve it!
True, this is why I sometimes prefer not to look too closely at the state of the art when addressing a new system architectural problem.
By trying to solve it from first principles, and making some progress but most likely failing to completely solve it, I get a good sense of that the design forces and challenges are. Then, when I go back to study the state of art, I'm much better placed to understand what I'm studying.
Another good thing about this approach is that I minimize pre-biasing my thinking by what is out there now. Then, when studying existing afterward, solutions, I can more clearly begin to see gaps in them that I can direct my efforts to solving.
> Late-bind on designs. The goal of the design process is not to generate a single point solution, but to instead characterize the design space for a given problem: a single point should then fall naturally out of that space given the problem constraints.
There is also value in early binding combined with a willingness to iterate. There is a lot of knowledge to be gained by trying to do things rather than staying on design level. I‘m making the comment not to contradict the author but to stress that exploring the design space can also mean doing things for real. The risks with a wrongly applied late biding are superficial designs and design paralysis.
- spend little time making a couple design decisions -- lots of handwaving and "we'll figure it out"
- bind these decisions, then write a tiny app to see what it looks like. What are the technical consequences? What are the business consequences (including risks)? How does this affect development?
- repeat 2-4x. The goal is to maximize speed of feedback, at the cost of minimal scope and low quality. That's fine.
- iterate more, expanding the scope and quality as needed
- you're done!
Use high-level diagrams to help focus the effort. This dramatically helps discussions with stakeholders and developers.
What you are describing is (to me) the pure essence of what it means to be an artist. There are a lot of days where I feel like software and systems engineering is more about creativity than it is math and science.
Try things, take risks, etc. The really amazing thing with software is that the cost of iteration is basically zero (+ your time). You don't even have to go buy new paints or brushes periodically. You can reset your digital canvas a billion times per day if you desire. You can even cheat and set waypoints in time that allow you to instantly teleport to any arbitrary moment with perfect recall. You can create infinite copies of your work at various stages. You can trivially blend your works together. There is no other media on earth that comes close to possessing these same attributes.
I find Developers like to have a specific Goal, then explore the space towards it. An Artist does the opposite: they make random marks and explore the marks that they "like".
Art iteration is also almost free. In fact, the "cost" of physical media (pencils and paper) helps the human brain understand and integrate what it's learning, vs digital is harder for ideas to stick.
Developers could gain by doing the same thing as artists: do random things to gain knowledge, then create a Goal and work towards it.
This article is excellent. I did not expect that. I'm what you might call an experienced software engineer (late 40s), who has designed a number of large systems. I found myself nodding enthusiastically as I was reading.
Over the years I’ve found myself more and more interested in designing systems than paying for existing system design mistakes - I want to help people learn to avoid them. (Coming at this from software engineer).
To me, it’s clear that designing systems is the way to maximum ongoing profit. Yet no-one I’ve come across in the world of business (still clinging on to hope) seems interested in an intentional, on-going system design process - especially when it comes to the software part, and they’re often not very good at the non-software part either.
I find myself a little bit stuck doing hands-on software engineering for companies who’ve gotten themselves into a system design hole. Companies are willing to throw tremendous resources at paying for system design mistakes, but not at avoiding them, or correcting them.
I don’t really know how to profitably find my way upward, except for becoming a manager or a consultant.
Thanks for this article, I am a beginner in this space.
I think systems suffer from interaction permutations complexity and the approach to look for a fundamental building block fails, because There Is No Primal Particle™. So we try introduce fundamental particle such as "everything is function", "everything is a list", "everything is a type", "everything is a file". It doesn't work. We do it to keep it in our heads but as soon as the design hits a real interaction, we have to get detailed again.
The C++ spec is enormous and very detailed. Kubernetes.
How does multithreading interact with garbage collection? How does async Rust interact with the borrow checker? How does your type system interact with - everything else? How does DNS going down interact with the rest of your cluster? How does autoscaling fail if your Docker repository is down? Can you rebuild a machine if your debian packageserver is down? How does POSIX interact with security?
What's the building block between things itself?
I am currently trying to design tooling around state space exploration and new type generation from interaction combination information by modelling interactions directly. And the fixpoint of interactions between interactions. (I am inspired by the Nothing I understand of TLA)
My dream: I am told by the computer what interactions I need to handle to handle all the cases I throw at the computer.
> Always talk about a second application. For each abstraction, the “app” is the layer above it. For example, a filesystem is an app for a block device; TCP is an app for IP. You should be able to describe the functionality of a layer without ever referring to the specifics of the app (e.g., you don’t need to know what a file is when talking about an SSD’s internals).
This is so true. It’s very common to see infrastructure being built in terms of the application that will run on it instead of designing the application to run on the infrastructure. A lot of brittle, unmodifiable messes are the result of this.
Part of the challenge is knowing how to slice the layers and when. Premature optimization is often crossing such layers too soon. Yet there are times when it becomes necessary for performance, such as Oracle bypassing FS to write to block devices or kernel patches for specialized networking cases.
I’ve had a lot of success building good abstractions with a version of this “second app” idea. The delta between designing for 1 app and designing for 2 apps is vast, but once you’ve designed for 2, you are generally very close to being able to support N apps. By the time you get to 3, you understand the problem domain very well and have a lot of confidence in your interfaces.
I'm relatively new in this and recently I've been reading and hearing a lot about how distributed systems are overused and usually a monolith can do a better job etc. I'm working at a company serving millions of customers and distributed systems are utilized, most of my experience as a software engineer was built around this to the point that I was never around a codebase that is 10/20k+ lines of code. I feel like I lack some skills in modular monolith coding skills.
My question is what kind of sources can I read to improve my understanding of monolith vs microservices, when to use each, and the tradeoff of preferring one over the other?
> hearing a lot about how distributed systems are overused and usually a monolith can do a better job etc
> improve my understanding of monolith vs microservices, when to use each, and the tradeoff of preferring one over the other
A lot of this will depend on where you're working. Where I'm at, the preference is for single tasks up to somewhere in the neighborhood of 200 gigs of RAM and commensurate CPU. Our individual servers have just stupid amounts of RAM and CPU on them, and our deployment stack has..... nontrivial amounts of overhead for.... reasons.
But if you're e.g. deploying on AWS, the price optimization point is gonna be different. And if you're deploying while working at Amazon it'll be yet a different tradeoff (cynically, having more to do with pager duty boundaries)
> New designs should be described in terms of the design space, so you can immediately convey their relative position compared to other point solutions. Expect a lot of statements of the form: “all solutions must do X”; “solution Y is just X with one change"
Except that as a developer when I hear "solution Y is just X with one change", I know that this is BS and that there won't be just that change which will be annoying because I'll have to refactor later..
I am interpreting it as an n-dimensional graph (let's say 2D for simplicity), and the axes may be something like "speed" and "scale", or "cost" and "risk", or whatever your parameters happen to be. The solution is somewhere in the space defined by those axes, but you don't know where it is yet, and you don't want to lock yourself in to a single point too early. The axes themselves could be anything. You are choosing what to care about. Ideally, it should be informed by understanding your users, your stakeholders, your domain, your market, resources, available technology, divine inspiration, etc., but it's an important decision that is probably completely contextual and is part of the design process.
I could be wrong, this is, as noted, just an interpretation.
Interesting use of Raymond Carver’s short story title “What we talk about when we talk about love”, which was also the background play in Birdman. If you haven’t read any Carver, treat yoself.
I understand your comment is tongue-in-cheek, but there's indeed a luddite-like movement that hides itself behind microservices cliches but under the surface they deny the very reason of existence of distributed systems in particular and the system design field in general. To them, the work of putting together a web app is a solved problem consisting of a single process doing everything under the sun, and the only acceptable hint of a software architecture is breaking the app in modules.
With a lot of systems, the business system design is complex enough in the first place that if you add a distributed system design to it as well you just end up with a colossal mess. And worse still you can't even see a lot of the complexity easily as it's hidden in how the distributed system is configured.
A normal system design, derisively called a monolith by some, is much clearer and explicit. It's less code. It's more reliable, less brittle. It comes with less footguns.
Distributed systems are not a better technology, like weaving looms were, they are simply an alternative design that is presently overused by incompetent software architects.
So it's not a luddite movement, it's an anti-complexity-for-the-sake-of-complexity movement. Try to not call us luddites again. Personally I view people that advocate for such designs by default as incompetent and that they should never be let anywhere near system design as they clearly don't understand the costs.
Simpler is almost always better, and there needs to be extremely good reasons to switch to incredibly costly distributed design patterns.
> With a lot of systems, the business system design is complex enough in the first place that if you add a distributed system design to it as well you just end up with a colossal mess.
The goals of system design are a) a system actually works, b) the system is not a colossal mess.
In a similar manner, the challenge of any engineering field is to not allow things to become overly complex.
> A normal system design, derisively called a monolith by some, is much clearer and explicit. It's less code. It's more reliable, less brittle. It comes with less footguns.
You're presuming that your average monolith is the result of a refined design. It is not. The main gripe that the luddite-like movement directs at systems design in general is basically the reality that software projects actually need planning and a working software architecture, where in monoliths they can just pile stuff in there without thought or criteria.
> Doesn't seem like you're actually disagreeing then, if systems design can apply just as well to a 'monolith' and it's often the better design.
"Can" and "do" are two different words.
"Monolith" is often a cop out to hide amorphous big-ball-of-mud projects. In fact, the luddite-like movement's aversion to system design is rooted in the fact that it's not necessary, and their irrational aversion to microservices and distributed systems and systems design lies in the fact that, unlike monoliths, they are far less tolerant of hacking together big balls of mud and demand instead some degree of discipline to design a system and comply with the system design.
> unlike monoliths, they are far less tolerant of hacking together big balls of mud and demand instead some degree of discipline to design a system and comply with the system design.
A demand that is often poorly met, leaving one with a distributed ball of mud that is more difficult to reason about and refactor.
Already it suggests a surprising idea, which may or may not be fleshed out elsewhere, that a good way to understand an existing system is to try and solve the same problem without reference to how the existing system works:
> You will often expend more cycles understanding the existing design than you would solving the problem from first principles. ...
> Even good solutions can bias your thinking towards a particular part of the design space. ... A great time to look at other systems is after the Design phase, to see if you can map those solutions to your space. Even better, you can often reverse-engineer the details of solutions simply by understanding where they fit in your design space.
If it doesn't already exist somewhere else, we can call it Balakrishnan's Law - the best way to understand a problem is to solve it!