Google doesn't do the UDP stuff anymore. IIRC, they never did it for channels no...

twic · on Jan 1, 2021

> Do you have a multicast group for each service, and have each client tasks leave and join it when interested?

That's what i was thinking, yes.

> I'm not a multicast expert but I don't think switches can handle that.

As in it's impossible, or that's too much load? That would be 10 000 groups, each with 100 senders and some number of hundreds of receivers, but with membership changing relatively slowly. There would be lots of packets to move, but not a lot of IGMP to snoop, i think.

The total number of load messages sent is fewer in this model than in the model in the article, i think, because each backend sends a single multicast message, instead of N unicast TCP messages. The total number of messages delivered will be much higher - maybe about a hundred times higher?

> Maybe you'd have all the backend tasks on a machine report their load to a machine-level aggregator (rather than a cluster-wide one), which broadcasts it to all the other machines in the cluster, and then fans out from there to all the interested clients on that machine.

Service mesh style? I think the main effect is to bundle the messages into bigger packets, using a single group, since you still need to deliver the same set of messages to every machine. You actually end up delivering more messages, since there isn't any selectivity. I honestly don't know how this pans out!

> They mention using power of two choices to decrease the amount of state ("contentious data structures") a client has to deal with. I think mostly they mean not doing O(backend_tasks) operations on the RPC hot path or contending on a lock ever taken by operations that take O(backend_tasks), but even for less frequent load updates I'm not sure they want to be touching this state at all, or doing it in a lockless way, and ideally not maintaining it in RAM at all.

I was a bit confused by this, because i don't think it's a lot of state (one number per backend), so it doesn't seem like it would be hard to maintain and access. Particularly relative to the cost of doing an RPC call.

> So if you had perfect information for the load of all the servers, not just the ones in your current subset, what would you do with it anyway?

As i said, you keep a set of connections open, distribute requests between them using P2C or a weighted random choice, but also periodically update the set according to load statistics. This means you're not stuck with a static set of backends.

scottlamb · on Jan 2, 2021

> As in it's impossible, or that's too much load? That would be 10 000 groups, each with 100 senders and some number of hundreds of receivers, but with membership changing relatively slowly. There would be lots of packets to move, but not a lot of IGMP to snoop, i think.

Millions of memberships in total. That seems far enough out of the norm that I'm skeptical it'd be well-supported on standard switches. I could be wrong though.

> Service mesh style? I think the main effect is to bundle the messages into bigger packets, using a single group, since you still need to deliver the same set of messages to every machine. You actually end up delivering more messages, since there isn't any selectivity. I honestly don't know how this pans out!

Yeah, that's the idea. I don't know either, but I'd expect bundling would make them significant cheaper due to fewer wake-ups, better cache effectiveness, etc.

> The total number of load messages sent is fewer in this model than in the model in the article, i think, because each backend sends a single multicast message, instead of N unicast TCP messages. The total number of messages delivered will be much higher - maybe about a hundred times higher?

In the model in the article, the load reports can ride along with an RPC reply (when active) or health check reply (when not), so I don't think they cost much to send or receive.

> I was a bit confused by this, because i don't think it's a lot of state (one number per backend), so it doesn't seem like it would be hard to maintain and access.

It might be more about lock contention than total CPU usage.

> As i said, you keep a set of connections open, distribute requests between them using P2C or a weighted random choice, but also periodically update the set according to load statistics. This means you're not stuck with a static set of backends.

Sorry, somehow I missed the second paragraph of your original message. But it seems like you wouldn't want those periodic updates to be too slow (not correcting problems quickly enough) or too fast (too much churn, maybe also "thundering herd" causing oscillations in a backend task's channel count). I'm not sure if a sweet spot exists or not without some prototyping. Maybe it is workable. But Google and Twitter both have described approaches to having a good-enough subset choice right from the beginning, so why mess around with the periodic updates and the extra moving parts and state (IGMP memberships and/or the aggregator thing)?