Google doesn't do the UDP stuff anymore. IIRC, they never did it for channels not in the subset, and I don't think it'd make sense to given the goal of having a stable subset choice. And they haven't been able to actually send RPCs over a channel in UDP mode since LOAS IIRC, so the UDP stuff was useless and eventually removed. I'm not sure it was ever that useful anyway...
They do still deactivate inactive channels, just totally rather than downgraded to UDP. Annoyingly, they only compute this inactivity for channels to individual tasks, not the greater load-balancing channels, and freshly started backend tasks always come up as active. So if you have a lot of inactive clients, when your tasks restart the inactive clients all rush to connect to it and your task sees noticeably higher health-checking load until the clients hit the inactivity timeout and disconnect again.
To more directly answer twic's question: I can think of a few reasons multicasting all the load reports probably doesn't make sense:
* That sounds like an awful lot of multicast groups to manage and/or a fair bit of multicast traffic. Let's say you have a cluster of 10,000 machines running 10,000 services averaging 100 tasks per service (and thus also averaging 100 tasks per machine). (Some services have far more than 100 tasks; some are tiny.) Each service might be a client of several other services and a backend to several other services. Do you have a multicast group for each service, and have each client tasks leave and join it when interested? I'm not a multicast expert but I don't think switches can handle that. More realistically you'd have your own application-level distributor to manage that, which is a new point of failure. Maybe you'd have all the backend tasks on a machine report their load to a machine-level aggregator (rather than a cluster-wide one), which broadcasts it to all the other machines in the cluster, and then fans out from there to all the interested clients on that machine. That might be workable (not a cluster-wide SPOF, each machine only handles 10,000 inbound messages per load reporting period, and each aggregator's total subscription count is at most some reasonable multiple of its 100 tasks) but adds moving pieces and state that I'd avoid without a good reason. edit: ...also, you'd need to do something different for inter-cluster traffic...
* They mention using power of two choices to decrease the amount of state ("contentious data structures") a client has to deal with. I think mostly they mean not doing O(backend_tasks) operations on the RPC hot path or contending on a lock ever taken by operations that take O(backend_tasks), but even for less frequent load updates I'm not sure they want to be touching this state at all, or doing it in a lockless way, and ideally not maintaining it in RAM at all.
* The biggest reason: session establishment is expensive in terms of latency (particularly for inter-cluster stuff where you don't want multiple round trips) and CPU (public-key crypto, and I don't think an equivalent of TLS session resumption to avoid this would help too often). That's why they talk about "minimal disruption" being valuable. So if you had perfect information for the load of all the servers, not just the ones in your current subset, what would you do with it anyway?
> Do you have a multicast group for each service, and have each client tasks leave and join it when interested?
That's what i was thinking, yes.
> I'm not a multicast expert but I don't think switches can handle that.
As in it's impossible, or that's too much load? That would be 10 000 groups, each with 100 senders and some number of hundreds of receivers, but with membership changing relatively slowly. There would be lots of packets to move, but not a lot of IGMP to snoop, i think.
The total number of load messages sent is fewer in this model than in the model in the article, i think, because each backend sends a single multicast message, instead of N unicast TCP messages. The total number of messages delivered will be much higher - maybe about a hundred times higher?
> Maybe you'd have all the backend tasks on a machine report their load to a machine-level aggregator (rather than a cluster-wide one), which broadcasts it to all the other machines in the cluster, and then fans out from there to all the interested clients on that machine.
Service mesh style? I think the main effect is to bundle the messages into bigger packets, using a single group, since you still need to deliver the same set of messages to every machine. You actually end up delivering more messages, since there isn't any selectivity. I honestly don't know how this pans out!
> They mention using power of two choices to decrease the amount of state ("contentious data structures") a client has to deal with. I think mostly they mean not doing O(backend_tasks) operations on the RPC hot path or contending on a lock ever taken by operations that take O(backend_tasks), but even for less frequent load updates I'm not sure they want to be touching this state at all, or doing it in a lockless way, and ideally not maintaining it in RAM at all.
I was a bit confused by this, because i don't think it's a lot of state (one number per backend), so it doesn't seem like it would be hard to maintain and access. Particularly relative to the cost of doing an RPC call.
> So if you had perfect information for the load of all the servers, not just the ones in your current subset, what would you do with it anyway?
As i said, you keep a set of connections open, distribute requests between them using P2C or a weighted random choice, but also periodically update the set according to load statistics. This means you're not stuck with a static set of backends.
> As in it's impossible, or that's too much load? That would be 10 000 groups, each with 100 senders and some number of hundreds of receivers, but with membership changing relatively slowly. There would be lots of packets to move, but not a lot of IGMP to snoop, i think.
Millions of memberships in total. That seems far enough out of the norm that I'm skeptical it'd be well-supported on standard switches. I could be wrong though.
> Service mesh style? I think the main effect is to bundle the messages into bigger packets, using a single group, since you still need to deliver the same set of messages to every machine. You actually end up delivering more messages, since there isn't any selectivity. I honestly don't know how this pans out!
Yeah, that's the idea. I don't know either, but I'd expect bundling would make them significant cheaper due to fewer wake-ups, better cache effectiveness, etc.
> The total number of load messages sent is fewer in this model than in the model in the article, i think, because each backend sends a single multicast message, instead of N unicast TCP messages. The total number of messages delivered will be much higher - maybe about a hundred times higher?
In the model in the article, the load reports can ride along with an RPC reply (when active) or health check reply (when not), so I don't think they cost much to send or receive.
> I was a bit confused by this, because i don't think it's a lot of state (one number per backend), so it doesn't seem like it would be hard to maintain and access.
It might be more about lock contention than total CPU usage.
> As i said, you keep a set of connections open, distribute requests between them using P2C or a weighted random choice, but also periodically update the set according to load statistics. This means you're not stuck with a static set of backends.
Sorry, somehow I missed the second paragraph of your original message. But it seems like you wouldn't want those periodic updates to be too slow (not correcting problems quickly enough) or too fast (too much churn, maybe also "thundering herd" causing oscillations in a backend task's channel count). I'm not sure if a sweet spot exists or not without some prototyping. Maybe it is workable. But Google and Twitter both have described approaches to having a good-enough subset choice right from the beginning, so why mess around with the periodic updates and the extra moving parts and state (IGMP memberships and/or the aggregator thing)?
They do still deactivate inactive channels, just totally rather than downgraded to UDP. Annoyingly, they only compute this inactivity for channels to individual tasks, not the greater load-balancing channels, and freshly started backend tasks always come up as active. So if you have a lot of inactive clients, when your tasks restart the inactive clients all rush to connect to it and your task sees noticeably higher health-checking load until the clients hit the inactivity timeout and disconnect again.
To more directly answer twic's question: I can think of a few reasons multicasting all the load reports probably doesn't make sense:
* That sounds like an awful lot of multicast groups to manage and/or a fair bit of multicast traffic. Let's say you have a cluster of 10,000 machines running 10,000 services averaging 100 tasks per service (and thus also averaging 100 tasks per machine). (Some services have far more than 100 tasks; some are tiny.) Each service might be a client of several other services and a backend to several other services. Do you have a multicast group for each service, and have each client tasks leave and join it when interested? I'm not a multicast expert but I don't think switches can handle that. More realistically you'd have your own application-level distributor to manage that, which is a new point of failure. Maybe you'd have all the backend tasks on a machine report their load to a machine-level aggregator (rather than a cluster-wide one), which broadcasts it to all the other machines in the cluster, and then fans out from there to all the interested clients on that machine. That might be workable (not a cluster-wide SPOF, each machine only handles 10,000 inbound messages per load reporting period, and each aggregator's total subscription count is at most some reasonable multiple of its 100 tasks) but adds moving pieces and state that I'd avoid without a good reason. edit: ...also, you'd need to do something different for inter-cluster traffic...
* They mention using power of two choices to decrease the amount of state ("contentious data structures") a client has to deal with. I think mostly they mean not doing O(backend_tasks) operations on the RPC hot path or contending on a lock ever taken by operations that take O(backend_tasks), but even for less frequent load updates I'm not sure they want to be touching this state at all, or doing it in a lockless way, and ideally not maintaining it in RAM at all.
* The biggest reason: session establishment is expensive in terms of latency (particularly for inter-cluster stuff where you don't want multiple round trips) and CPU (public-key crypto, and I don't think an equivalent of TLS session resumption to avoid this would help too often). That's why they talk about "minimal disruption" being valuable. So if you had perfect information for the load of all the servers, not just the ones in your current subset, what would you do with it anyway?