In my experience Google's deterministic subsetting usually ensures backends have even connection counts. There's one significant corner case though: a large backend service that has a large number of small clients, such that client_task_count * subset_size < backend_task_count. Then it basically degrades to random selection. This can happen to infrastructure that supports all of Google's products, many of which are relatively tiny.
Seems like their "Continuous Ring Coordinates" section onward is designed to address this same problem. I got a bit lost following it though; I might need to read through it sometime my kids are not also wanting my attention...
Context: Earlier this year I explored using deterministic aperture for balancing writes/appends based on storage usage and implemented a visualization to help explain the algorithm to my colleagues.
Isn't the squid-like graph in the middle of the article caused simply by using tiny subsets over tiny populations? Would you really bother with subsetting if each participant had fewer than, say, 100 connections? Put a different way, is the reduction in connections from 280k to 25k at the end of the article a thing worth achieving? Any single Linux process can deal with 280k connections, so having that be the aggregate number of connections in a distributed service architecture strikes me as a non-problem.
What if the backends announced their load via multicast? Then every client could know the load of every backend, and could continue to use P2C or something like that, without needing to maintain a socket for every backend.
There's still a problem to solve, because you don't want to be opening and closing connections to backends all the time - you want to reuse a small number of connections. So maybe you keep a set of connections open, distribute requests between them using P2C or a weighted random choice, but also periodically update the set according to load statistics.
This is a bit like what Google does, but not multicast. Inactive RPC channels are downgraded from TCP to UDP and the health/load packets are sent less often. A client can maintain a smaller active subset of channels but still have complete load data for all its peers.
Google doesn't do the UDP stuff anymore. IIRC, they never did it for channels not in the subset, and I don't think it'd make sense to given the goal of having a stable subset choice. And they haven't been able to actually send RPCs over a channel in UDP mode since LOAS IIRC, so the UDP stuff was useless and eventually removed. I'm not sure it was ever that useful anyway...
They do still deactivate inactive channels, just totally rather than downgraded to UDP. Annoyingly, they only compute this inactivity for channels to individual tasks, not the greater load-balancing channels, and freshly started backend tasks always come up as active. So if you have a lot of inactive clients, when your tasks restart the inactive clients all rush to connect to it and your task sees noticeably higher health-checking load until the clients hit the inactivity timeout and disconnect again.
To more directly answer twic's question: I can think of a few reasons multicasting all the load reports probably doesn't make sense:
* That sounds like an awful lot of multicast groups to manage and/or a fair bit of multicast traffic. Let's say you have a cluster of 10,000 machines running 10,000 services averaging 100 tasks per service (and thus also averaging 100 tasks per machine). (Some services have far more than 100 tasks; some are tiny.) Each service might be a client of several other services and a backend to several other services. Do you have a multicast group for each service, and have each client tasks leave and join it when interested? I'm not a multicast expert but I don't think switches can handle that. More realistically you'd have your own application-level distributor to manage that, which is a new point of failure. Maybe you'd have all the backend tasks on a machine report their load to a machine-level aggregator (rather than a cluster-wide one), which broadcasts it to all the other machines in the cluster, and then fans out from there to all the interested clients on that machine. That might be workable (not a cluster-wide SPOF, each machine only handles 10,000 inbound messages per load reporting period, and each aggregator's total subscription count is at most some reasonable multiple of its 100 tasks) but adds moving pieces and state that I'd avoid without a good reason. edit: ...also, you'd need to do something different for inter-cluster traffic...
* They mention using power of two choices to decrease the amount of state ("contentious data structures") a client has to deal with. I think mostly they mean not doing O(backend_tasks) operations on the RPC hot path or contending on a lock ever taken by operations that take O(backend_tasks), but even for less frequent load updates I'm not sure they want to be touching this state at all, or doing it in a lockless way, and ideally not maintaining it in RAM at all.
* The biggest reason: session establishment is expensive in terms of latency (particularly for inter-cluster stuff where you don't want multiple round trips) and CPU (public-key crypto, and I don't think an equivalent of TLS session resumption to avoid this would help too often). That's why they talk about "minimal disruption" being valuable. So if you had perfect information for the load of all the servers, not just the ones in your current subset, what would you do with it anyway?
> Do you have a multicast group for each service, and have each client tasks leave and join it when interested?
That's what i was thinking, yes.
> I'm not a multicast expert but I don't think switches can handle that.
As in it's impossible, or that's too much load? That would be 10 000 groups, each with 100 senders and some number of hundreds of receivers, but with membership changing relatively slowly. There would be lots of packets to move, but not a lot of IGMP to snoop, i think.
The total number of load messages sent is fewer in this model than in the model in the article, i think, because each backend sends a single multicast message, instead of N unicast TCP messages. The total number of messages delivered will be much higher - maybe about a hundred times higher?
> Maybe you'd have all the backend tasks on a machine report their load to a machine-level aggregator (rather than a cluster-wide one), which broadcasts it to all the other machines in the cluster, and then fans out from there to all the interested clients on that machine.
Service mesh style? I think the main effect is to bundle the messages into bigger packets, using a single group, since you still need to deliver the same set of messages to every machine. You actually end up delivering more messages, since there isn't any selectivity. I honestly don't know how this pans out!
> They mention using power of two choices to decrease the amount of state ("contentious data structures") a client has to deal with. I think mostly they mean not doing O(backend_tasks) operations on the RPC hot path or contending on a lock ever taken by operations that take O(backend_tasks), but even for less frequent load updates I'm not sure they want to be touching this state at all, or doing it in a lockless way, and ideally not maintaining it in RAM at all.
I was a bit confused by this, because i don't think it's a lot of state (one number per backend), so it doesn't seem like it would be hard to maintain and access. Particularly relative to the cost of doing an RPC call.
> So if you had perfect information for the load of all the servers, not just the ones in your current subset, what would you do with it anyway?
As i said, you keep a set of connections open, distribute requests between them using P2C or a weighted random choice, but also periodically update the set according to load statistics. This means you're not stuck with a static set of backends.
> As in it's impossible, or that's too much load? That would be 10 000 groups, each with 100 senders and some number of hundreds of receivers, but with membership changing relatively slowly. There would be lots of packets to move, but not a lot of IGMP to snoop, i think.
Millions of memberships in total. That seems far enough out of the norm that I'm skeptical it'd be well-supported on standard switches. I could be wrong though.
> Service mesh style? I think the main effect is to bundle the messages into bigger packets, using a single group, since you still need to deliver the same set of messages to every machine. You actually end up delivering more messages, since there isn't any selectivity. I honestly don't know how this pans out!
Yeah, that's the idea. I don't know either, but I'd expect bundling would make them significant cheaper due to fewer wake-ups, better cache effectiveness, etc.
> The total number of load messages sent is fewer in this model than in the model in the article, i think, because each backend sends a single multicast message, instead of N unicast TCP messages. The total number of messages delivered will be much higher - maybe about a hundred times higher?
In the model in the article, the load reports can ride along with an RPC reply (when active) or health check reply (when not), so I don't think they cost much to send or receive.
> I was a bit confused by this, because i don't think it's a lot of state (one number per backend), so it doesn't seem like it would be hard to maintain and access.
It might be more about lock contention than total CPU usage.
> As i said, you keep a set of connections open, distribute requests between them using P2C or a weighted random choice, but also periodically update the set according to load statistics. This means you're not stuck with a static set of backends.
Sorry, somehow I missed the second paragraph of your original message. But it seems like you wouldn't want those periodic updates to be too slow (not correcting problems quickly enough) or too fast (too much churn, maybe also "thundering herd" causing oscillations in a backend task's channel count). I'm not sure if a sweet spot exists or not without some prototyping. Maybe it is workable. But Google and Twitter both have described approaches to having a good-enough subset choice right from the beginning, so why mess around with the periodic updates and the extra moving parts and state (IGMP memberships and/or the aggregator thing)?
Control theory is mostly based on physical devices and analog variables, so it does not often fit with software systems.
However you are right nonetheless. Large networks contain a lot of caching of different kinds and other forms of data replication.
Caches warm up by transferring data and the available bandwidth to do so is never infinite. Especially when flipping traffic between whole datacenters.
Most load balancing systems are simply unaware of this.
I often wonder this as well, is it a matter of unfamiliarity with control-theory stuff? Surely not as it’s fairly well known, or at least seems like you’d encounter it fairly quickly upon research.
It seems to me that control theory would be fantastic for setting and automatically adjusting some of these parameters. Distributing the same PID params and model to each client would ensure consistent/predictable behaviour as well.
I am going to massively over simplify, but here goes. The really big idea from control theory is the idea of negative feedback. This basically boils down to measuring the output of your system and making the input to the system some function of the difference between the input signal and the output.
The parent post refers to PID control. This refers to the three types of commonly used things to do with the difference (or error) signal.
P is for proportional - where you just multiply the error signal with a constant. What this does is to encourage the system to track the level of the input signal (but potentially with some lag)
I is for integral. This is where you integrate the difference signal over time. What this does is to reduce the lag between the input and output signal
D is for derivative. This is where you feed back in the derivative of the error signal What this does is to damp down the swings in the system especially those that come from being too aggressive with the above two knobs.
Good controller design often comes down to picking the right weights for each of the three types of feedback functions you can input into the system. So in this example, it might be that you distribute your requests to servers based on how over or under loaded the servers are...
It's unclear to me how a PID controller applies to the problem described in the article: having each client choose a subset of backend tasks in a way that will be stable for a long time. PID controllers deal with a scalar setpoint, scalar measured variable, and scalar control output and adjust frequently. How is that helpful at all in choosing this subset? It's certainly not a straightforward application of standard literature that the authors completely ignored, as suggested by siscia's grandparent comment.
> So in this example, it might be that you distribute your requests to servers based on how over or under loaded the servers are.
They had a working approach to distributing requests to servers, described at the beginning of the article. It's sessions (aka connections or channels in Google's literature) they're focusing on.
Responding also to siscia's grandchild comment:
> Or you can spin up more machines. The integral and derivative part will tell you when spinning up more machines or when to tear down the one you already got!
You're describing something like Google's autopilot, which I just linked to in another comment. A PID controller might make sense there, but it's only tangentially related to the problem described in this article.
I personally was thinking to apply a controller to each client machine and making it control the number of connections to the backend.
You can control for the P90 latency and increase or decrease the number of connections to backend machines.
Similarly the backend can decide to drop connections if it see that the latency of the reply is too high, or if the CPU is too high or whatever other metrics make sense.
I don't see if a similar system would ever reach a stable-enough state.
Are integral and derivative important for server load balancing? If you can't handle a sudden increase in load then we need to implement throttling, rather than load-balancing.
Unfortunately this wouldn't be practical when the client's task count changes frequently. [1] You wouldn't want all the client tasks to recalculate their subset every time, so anything that requires each client task to know the total client task count is a dead end. Likewise when the backend task count changes, client tasks shouldn't change whether backend tasks that existed before and still exist are in their chosen subset.
In my experience Google's deterministic subsetting usually ensures backends have even connection counts. There's one significant corner case though: a large backend service that has a large number of small clients, such that client_task_count * subset_size < backend_task_count. Then it basically degrades to random selection. This can happen to infrastructure that supports all of Google's products, many of which are relatively tiny.
Seems like their "Continuous Ring Coordinates" section onward is designed to address this same problem. I got a bit lost following it though; I might need to read through it sometime my kids are not also wanting my attention...