QUIC costs something like 2x to 4x as much CPU time to serve large files or stre...

toast0 · on March 24, 2019

In addition to those downsides, the QUIC spec points out middleboxes tend to time out UDP streams pretty agressively, so it recommends a ping timer of 10 seconds.

Additionally, since QUIC streams allow for client IP mobility, that creates an additional challenge for IP level load balancing as well as handling at the host level. In a well configured host, TCP packets for a given stream will always arrive at the same nic queue, on the same CPU, allowing the TCP data structure to be local to that CPU and avoid cross-cpu locks. In QUIC, the next packet can come from a new IP, which could be ECMP routed to a different host, or arrive on a different NIC queue and a different CPU. Perhaps, your ECMP router and NIC can be taught to look for the QUIC connection IDs, but that doesn't seem at all certain.

Misdicorl · on March 24, 2019

That's not really a fair comparison. In the case that the IP changes for quic, tcp would have to completely re-establish the connection. A cross core memory access is tiny in comparison.

geuszb · on March 24, 2019

Thanks for sharing the insight... This being HN, it doesn't necessarily read to me like a disadvantage for QUIC the protocol as much as an opportunity for someone to come up with a way to do hardware-assisted QUIC in the networking interface...

codetrotter · on March 24, 2019

My first thought as well. So much so that I fully expect that we are going to see multiple companies pop up in the coming next few years that will take a shot a making said hardware.

the8472 · on March 24, 2019

> and sends it 1500b at a time

sendmmsg (or the upcoming io_uring) let you send multiple UDP packets with a single syscall.

codys · on March 24, 2019

While this is useful, I don't think it would completely resolve the noted "tons of send syscalls" issue. QUIC performs it's own flow control and I don't think it can just send all the packers composing a file at once (all the time, at least)

the8472 · on March 24, 2019

If your server handles many connections simultanously you can still bundle a lot of packets in a single sendmmsg syscall, it can dispatch to a different destination address for each packet.

drewg123 · on March 24, 2019

But I think each of these UDP packets will still travel separately from the syscall layer to the NIC (eg, no TSO). So you're still a factor of 40 or so behind TCP + TSO

Matthias247 · on March 24, 2019

I think in general I agree. However the overhead numbers are exaggerated, and we should be fair with that. E.g. it was already mentioned that multiple UDP packets can be transmitted via a single syscall, and reasonable implementations can make use of it. I haven't read the Quic spec (yet), so I don't know how much data can be aggregated without waiting for ACKs or interleaving other data - but if it's anything comparable to HTTP/2 then it should be configurable and support >= 64kB chunks.

I also don't think a QUIC server would read the whole file into user-space at once - that's just a giant memory waste. Rather it would be streamed and chunks would get encrypted. That process requires of course an extra copy (likely even two for the unencrypted and encrypted version), but that's the same for all user-space file serving and encryption options and nothing new due to QUIC. For KTLS it would need to get investigated whether the kernel solution doesn't also perform a copy somewhere (I honestly don't know).

drewg123 · on March 24, 2019

Of course it is not going to read the entire file at once.

Having written the FreeBSD kernel TLS, I can assure you that there is no copy. Data is brought into the kernel via DMA from storage into a page in the VM page cache. When the IO is done, it is then encrypted into an connection-private page. That page is then sent and DMA'ed on the network adapter. So we have in the kernel tls case:

  - memory DMA to kernel mem from storage.
  - memory READ from kernel mem to read plaintext for crypto
  - memory write to another chunk of kernel mem to write encrypted data
  - memory DMA from kernel mem to NIC

In the case where the NIC supports inline TLS offload, the middle 2 steps are skipped, and it devolves to essentially the unencrypted case.

For QUIC you have:

  - memory DMA to kernel mem from storage
  - memory read from kernel mem via mmap
  - memory write to userspace mem to write encrypted data
  - memory read from userspace mem to copy to kernel
  - memory write to kernel mem
  - memory DMA from kernel mem to NIC

So you go from 3 "copies" to 4 "copies", which increases memory bandwidth demands by 33%.

Right now, we can just barely serve 100g from a Xeon-D because Intel limited the memory bandwidth to DDR4-2400. At an effective bandwidth limit of 60GB/sec, that's on the edge of being able to handle the kernel TLS data path. So even if everything else about QUIC was free, this extra memory copy from userspace would cut bandwidth by a third.

Matthias247 · on March 25, 2019

Good to know. Thanks for the explanation and all the insights!

lclarkmichalek · on March 24, 2019

What does the TLS vs QUIC look like though?

mjoras · on March 25, 2019

There's no reason the offloads can't work with QUIC. Linux already has UDP GSO (https://lwn.net/Articles/752184/). There's no technical reason I can think of that kTLS cannot be implemented for UDP on Linux, it's just not there today.

There are also more general efforts underway on Linux to reduce the system call and copying overhead of processing packets in userspace. TPACKET_V3 is an easy way to vastly increase the scalability of UDP recv processing with minimal application changes. AF_XDP is much more extreme, but it is going to be more implementable than the older DPDK-style semantics. It effectively will put packet buffer management into userspace with the transport. But once you're doing that have recaptured much of the advantages that TCP has by running in the kernel.

irq-1 · on March 24, 2019

Two questions: can't large files continue to be served on HTTP2? and won't https://www.dpdk.org/ allow user-space network stacks to do segmentation, etc...? (Maybe it's too immature?)

Matthias247 · on March 24, 2019

Regarding the first question:

Even HTTP/2 involves some of the issues the parent mentions. HTTP/2 is not really helpful for large files, and might likely perform worse than HTTP/1.1 due to the additional insertion and parsing of control flow headers. HTTP/2 helps small files most, by avoiding the overhead of connection establishment for those.

vfaronov · on March 24, 2019

How much of this applies to 1~100KB responses?

drewg123 · on March 24, 2019

1k, not so much since there is no aggregation that can happen there anyway.

100k is not that much different than 100mb, except the TCP window will not be as far open, so TSO will not be as effective.

Note that I work on a CDN that serves large media files, so I'm biased towards that workload.

teacpde · on March 24, 2019

Awesome analysis. This is first time that I read about the downsides of QUIC, curious that whether implementing it in userspace was a conscious trade off knowing the performance downside or Google/IETF wasn’t aware of the problem at all?

mcguire · on March 24, 2019

Implementing new transport protocols in kernelspace has significant downsides for adoption. In fact, it has been a long time since anyone tried it.

simfoo · on March 24, 2019

Wouldn't pretty much all of that overhead compared to TCP vanish if QUIC was implemented in the kernel?

drewg123 · on March 24, 2019

No, due to the lack of TSO/LRO. Its my understanding that QUIC is designed to encrypt packet medata so that middle boxes cannot re-segment traffic. This same feature prevents NICs from doing TSO.

simfoo · on March 24, 2019

Ok thanks, that makes sense. For anyone else wondering what TSO is, see https://en.wikipedia.org/wiki/Large_send_offload

But again, couldn't there be NICs with offloading QUIC capabilities? Maybe this could even be done with firmware updates (I don't know how much of the TCP offloading is done in real hardware)

londons_explore · on March 24, 2019

If the NIC is given the key for the connection, it can do the segmenting, encrypting and retransmissions.

manigandham · on March 24, 2019

This is just normal technological progress. CPU time is cheap and scalable, and the protocol will keep getting more optimized with better software and hardware. Similar issues were brought up with HTTP2 using TLS everywhere and messing with proxies but that's no longer a problem.

QUIC/HTTP3 as a protocol is a great improvement to actual internet performance for users which is what really matters.

kev009 · on March 24, 2019

Picking your comment as the newest instance but this is one of the dumbest memes I see in this thread.

Things don't automatically get better. It is hard work, it sucks, and it's not for everyone. It will take years to undo the damage of this transition. We will still be working on it in a decade. There are some very subtle gains like HOL-blocking. I'm not convinced that outweighs current actualized improvements in TCP congestion control (BBR), and for any application I can think of the places that really need something message-oriented seem better covered by WebRTC.

What you are really talking about is Full Employment Theorem.

manigandham · on March 24, 2019

Yes, progress obviously takes effort. What part is a "meme"? Leave that nonsense out of HN.

What "damage" are you talking about? The only issues are compatibility and increased resource utilization on the server-side, both of which will get better as usage increases. It's not a problem. We go through these cycles all the time with all kinds of technology and there's nothing special here.

scurvy · on March 25, 2019

It's thinking like that which leads to web page bloat. CPU resources aren't free, especially in an environmental capacity.

manigandham · on March 25, 2019

No it's not. Webpage bloat is a developer issue, not a technical problem.

QUIC is a new protocol is to make user experiences better. There's a tradeoff in more server CPU but that's cheaper, more scalable, and will only be short-term as things quickly improve. The actual comparison would be rendering engines and Javascript runtimes that have become more complicated to build and run but are faster and more functional in return.

Nobody would return back to the 2010 tech days just because some people decided to make fat websites.

seanwilson · on March 24, 2019

> To serve the same amount of traffic with QUIC, you'd probably need multiple Xeon Gold CPUS. I guess that Google can afford this.

Can you explain more about how the negatives you mention weigh up against the positives? There isn't a net benefit somewhere? If not, can something be changed to give a better balance like a hybrid solution?

windexh8er · on March 24, 2019

Personally I believe that the majority of positive caters to privacy. That being said there are other positive things about IETF QUIC that will, likely, play into new functionality over time.

A good document outlining considerations can be found here: https://http3-explained.haxx.se/en/

galadran · on March 24, 2019

tl;dr - QUIC doesn't have kernel or hardware support (yet).

These aren't intrinsic problems with QUIC, they're common to all new protocols.