In ~2010, I was benchmarking Solarflare (Xilinx/AMD now) cards and their OpenOnload kernel-bypass network stack. The results showed that two well-tuned systems could communicate faster (lower latency) than two CPU sockets within the same server that had to wait for the kernel to get involved (standard network stack). It was really illuminating and I started re-architecting based on that result.
Backing out some of that history... in ~2008, we started using FPGAs to handle specific network loads (US equity market data). It was exotic and a lot of work, but it significantly benefited that use case, both because of DMA to user-land and its filtering capabilities.
At that time our network was all 1 Gigabit. Soon thereafter, exchanges started offering 10G handoffs, so we upgraded our entire infrastructure to 10G cut-through switches (Arista) and 10G NICs (Myricom). This performed much better than the 1G FPGA and dramatically improved our entire infrastructure.
We then ported our market data feed handler's to Myricom's user-space network stack, because loads were continually increasing and the trading world was continually more competitive... and again we had a narrow solution (this time in software) to a challenging problem.
Then about a year later, Solarflare and it's kernel-compatible OpenOnload arrived and we could then apply the power of kernel bypass to our entire infrastructure.
After that, the industry returned to FPGAs again with 10G PHY and tons of space to put whole strategies... although I was never involved with that next generation of trading tech.
I personally stayed with OpenOnload for all sorts of workloads, growing to use it with containerization and web stacks (Redis, Nginx). Nowadays you can use OpenOnload with XDP; again a narrow technology grows to fit broad applicability.
I discovered something similar in the early to mid 2010s: processes on different Linux systems communicated over TCP faster than processes on the same Linux system. That is, going over the network was faster than on the same machine. The reason was simple: there is a global lock per system for the localhost pseudo-device.
Processes communicating on different systems had actual parallelism, because they each had their own network device. Processes communicating on the same system were essentially serialized, because they were competing with each other for the same pseudo-device. At the time, Linux kernel developers basically said "Yeah, don't do that" when people brought up the performance problem.
That makes sense, Linux has many highly performant IPC options that don't involve the network device. Just the time it takes to correctly construct an ethernet frame is not negligible.
OpenOnload supports user-space acceleration of pipes. For our custom applications, we initiate connections over unaccelerated UNIX sockets and then "upgrade" them to accelerated pipes. It's like setting up pairs of shared-memory queues, but all operated via POSIX calls including the ability to watch them with epoll.
This a real old example of this technique using boost::asio [1]. We adapted that from the one in our own reactor library to Onload-pipe-accelerate some OSS that used ASIO.
> Just the time it takes to correctly construct an ethernet frame is not negligible.
For that reason, Onload/TCPDirect (and other vendors too) have APIs that allow you to pre-allocate and prepare packets (headers and payload), then you can do some final tweaks -- like choosing a ticker and price in an order packet -- and then send it out with minimal latency.
In fact, if I recall correctly, it doesn't seem to. I remember writing a tool that analyzed pcap dumps and initially assuming there would be Ethernet frames at the bottom, only to run into the fact that this wasn't true on lo, which had it's own frames, which I assume could vary by OS a bit.
There are many applications which are network transparent and would need extra code to know how to optimize when the other process happens to be on the same machine (e.g., using Sys V IPC or Unix domain sockets). There are good reasons to optimize the case of localhost.
I am always bemused when people imagine that communication over localhost involves packets hitting an ethernet adapter.
It has been many years since i tested this but unix sockets seemed slower to me than loopback TCP even
If you need real speed you should use UDP/MC, or I have used a lot of commercial middlewares which used "shared memory" transports. i.e. messaging paradigm APIs wrapping various configurable messaging middlewares to provide a fairly unified interface to the application messaging concepts.
>It has been many years since i tested this but unix sockets seemed slower to me than loopback TCP even
They're essentially the same... except they're not and there's a boatload of work in the middle that doesn't need to be done for unix domain sockets.
I can imagine some edge-cases however, where differences in stacks and configuration (in/out buffers vs. shared buffers, buffer sizes, packet loss vs. reliable, locking, blocking vs. non-blocking writes) may lead to more optimal scheduling under concurrent workloads. As soon as you have more than one thread doing something, there's a boatload of moving parts to consider.
But if you have everything properly tuned, unix domain sockets should win every time. If your results said something else, you probably weren't benchmarking what you thought you were benchmarking.
Shared memory of some kind will be fastest. It requires a re-architecture of the application, some of the times. Unfortunately, I can't find the Linux kernel mailing list response any more, but the kernel devs were not sympathetic to the use case of processes communicating over TCP on the same host.
I went down a similar path (Solarflare, Myricom, and Chelsio; Arista, BNT, and 10G Twinax) and found we could get lower latency internally by confining two OpenOnload-enabled processes that needed to communicate with each other to the same NUMA domain.
We architected our applications around that as well. The firm continued on to the FPGA path though that was after my time there.
I do still pick up SolarFlare cards off of eBay for home lab purposes.
Yup, all the same techniques we used (minus Docker but with the addition of disabling Intel Turbo Boost and hyper threading).
A few years ago I met with a cloud computing company who was following similar techniques to reduce the noisy neighbor problem and increase performance consistency.
Frankly, it’s good to see that there still are bare metal performance specialists out there doing their thing.
That reminds me of the when the openWRT and other open source guys were complaining that the home gateways of the time did not have a big enough CPU to max out the uplink (10-100mbps at the time), and instead built in hardware accelerators. What they did not know was that the hw accelerator was merely an even smaller CPU running a proprietary network stack.
That doesn't sound quite right to me. The worst CPU bottlenecks I recall for consumer routers running OpenWRT were mostly when doing traffic shaping (to compensate for bufferbloat in ISP equipment), not a complete inability to forward traffic fast enough. And when line speed really wasn't achievable with software routing, the hardware deficiency that mattered most was probably memory bandwidth rather than CPU core performance.
The kind of offload most prevalent on consumer routers is NAT performed by the Ethernet switch ASIC, which I don't think can be fairly described as an even smaller CPU running a proprietary network stack—you might have a microcontroller-class CPU core in the switch chip, but there's a separation of control and data planes.
I've worked quite extensively with openonload. There were significant numbers of "free wins" just from LD_PRELOAD though anything non trivial did require more configuration (mostly via env vars!). I've also written ef_vi applications for udp/mc pub in a similar market space to the one you describe. Whilst a little quirky in places you could write some incredibly fast stuff. on 2.xGHz xeon cpus you could send a useful datagram in 600-700ns. For a lot of applications you could even leave that in your critical path and not care.
I've not exhaustively characterized it, but RedHat has this comparison [1]. It is not enough for the scope/scale of my needs. I do still run bare metal workloads though.
That said, I have had operational issues in migrating to Docker, which is another sort of performance impact! I reference some of my core isolation issues in that Redis gist and this GitHub issue [2].
Since apparently no one is willing to read this excellent article, which even comes with fun sliders and charts...
> It turns out that Chrome actively throttles requests, including those to cached resources, to reduce I/O contention. This generally improves performance, but will mean that pages with a large number of cached resources will see a slower retrieval time for each resource.
I am a systems engineer. I read the article title, then started reading the article, and realized it was a bait and switch. It is not about "network" vs "cache" in computer systems terms, which is what you might expect. It is about "network" vs "the (usually antiquated) file-backed database your browser calls a cache." The former would have been a compelling article. The latter is kind of self-evident: the browser cache is there to save bandwidth, not to be faster.
I find it odd to call it a bait and switch when the first thing in the article is an outline.
> It is about "network" vs "the (usually antiquated) file-backed database your browser calls a cache."
It's actually nothing to do with the design of the cache itself as far as I can tell. If you finish reading the article you'll see that it's about a throttling behavior that interacts poorly with the common optimization advice for HTTP 1.1+, exposed by caching.
> The latter is kind of self-evident: the browser cache is there to save bandwidth, not to be faster.
I don't think that's something you can just state definitively. I suspect most people do in fact view the cache as an optimization for latency. Especially since right at the start of the article, the first sentence, the "race the cache" optimization is introduced - an optimization that is clearly for latency and not bandwidth purposes.
Memcached exists for two reasons: Popular languages that hit inflection points when in-memory caches exceeded a certain size, and network cards becoming lower latency than SATA hard drives.
The latter is a well known and documented phenomenon. The moment popular drives are consistently faster than network again, expect to see a bunch of people writing articles about using local or disk cache, recycling old arguments from two decades ago.
Agree with GP that this is getting pretty far off the article, but also--I think remote caches in datacenters are sometimes as much about making the cache shared as about the storage latency.
If you have a lot of boxes running your application, sometimes it's nice to be able to do an expensive thing once per expiry period across the cluster, not once per expiry per app server. Or sometimes you want a cheap simple way to invalidate or write a new value for the full cluster.
The throughput and latency of SSDs does make them a workable cache medium for some uses, with much lower cost/GB than RAM, but that can be independent of the local/remote choice.
When I have worked on distributed systems, there are often several layers of caches that have nothing to do with latency: the point of the cache is to reduce load on your backend. Often, these caches are designed with the principle that they should not hurt any metric (ie a well-designed cache in a distributed system should not have worse latency than the backend). This, in turn, improves average latency and systemic throughput, and lets you serve more QPS for less money.
CPU caches are such a common thing to think about now that we have associated the word "cache" with latency improvements, since that is one of the most obvious benefits of CPU caches. It is not a required feature of caches in general, though.
The browser cache was built for a time when bandwidth was expensive, often paid per byte, the WAN had high latency, and disk was cheap (but slow). I don't know exactly when the browser cache was invented, but it was exempted from the DMCA in 1998. Today, bandwidth is cheap and internet latency is a lot lower than it used to be. From first principles, it makes sense that the browser cache, designed to save you bandwidth, does not help your website's latency.
Edit: In light of the changes in the characteristics of computers and the web, this article seems to mainly be an argument for disabling caching on high-bandwidth links on the browser side, rather than suggesting "performance optimizations" that might silently cost your customers on low-bandwidth links money.
Cache is "slow" because the number of ongoing requests-- including to cache! --are throttled by the browser. I.e the cache isn't slow, but reading it is waited upon and the network request might be ahead in the queue.
The first three paragraphs you wrote are all very easily shown to be irrelevant. Again, the very first sentence of the post shows that the cache is designed to improve latency.
And as the article points out, the cache is in fact lower latency and does improve latency for a single request. The issue is with throttling.
The article is absolutely not suggesting disabling caching, nor does any information in the article indicate that that would be a good idea. Even with the issue pointed out by the article the cache is still lower latency for the majority of requests.
The title doesn't necessarily imply it's general. For one, the first word is "When". If the title was instead "Network is faster than browser cache" you'd have more of a point.
> "network" vs "the (usually antiquated) file-backed database your browser calls a cache."
Firefox uses SQLite for on-disk cache backend. Not the latest buzzword, but not exactly antiquated. I expect a cache backed in Chrome to be at least as fast.
> cache is there to save bandwidth, not to be faster
In most cases a cache saves bandwidth and reduces page load time at the same time. Internet connection which is faster than a local SSD/HDD is a rare case.
If you live in a rich country and are on fiber internet rather than a cable modem (or if your cable modem is new enough), you likely have better latency to your nearest CDN than you do to the average byte on an HDD in your computer. An SSD will still win, though.
The browser cache is kind of like a database and also tends to hold a lot of cold files, so it may take multiple seeks to retrieve one of them. Apparently it has throughput problems too, thanks to some bottlenecks that are created by the abstractions involved.
I can understand throttling network requests, but disk requests? The only reason to do that would be for power savings (you don't want to push the CPU into a higher state as it loads up the data).
I imagine it's just that the cache is in an awkward place where the throttling logic is unaware of the cache and the cache is not able to preempt the throttling logic.
Depends on if you value latency for the user. Saying I try both and choose the one coming first hurts no one but the servers not being protected by a client cache. But there’s absolutely no reason to believe a client has a cache that masks requests. AFAIK there’s no standard that says clients use caches for parsimony and not exclusively latency. As a matter of fact I think this is a good idea if it ever takes time to consult the cache, and the trade off is more bandwidth consumption which we are awash in. If you care that much run a caching proxy and use that and you’ll get the same effect of the client side cache masking requests. But I would say it’s superior because it always uses the local cache first and doesn’t waste user time on the edge condition in their cache coherency. It comes from Netscape which famously convinced everyone that it’s one of the hardest problems. That leads to the final benefit, the cache doesn’t have to cohere. If it’s too expensive at that moment to cohere and query then I can use the network source. Again the only downside is the network bandwidth is more consistently user. I would be hard pressed to believe most Firefox users already are grossly bandwidth over provisioned, and the amount of a fraction of a cable line a web browser loading from the cache no one could even notice that.
I do. Which is why it's silly to throttle the cache IO.
The read latency for disks is measured in microseconds. Is it possible for the server to be able to respond faster? Sure. However, if you aren't within 20 miles of the server then I can't see how it could be faster (speed of light and everything).
These design considerations will depend greatly on where you are at. It MIGHT be the case that eschewing a client cache in a server to server talk is the right move because your servers are likely physically close together. That'd mean the server can do a better job making multiple client requests faster through caching saving the memory/disk space required for the clients.
There is also the power consideration. It take a lot more power for a cell phone to handle a network request than it does to handle a disk request. Shouting into the ether isn't cheap.
I think you’re misunderstanding the context of where Firefox works. Have you ever sat down at someone’s choking machine and wondered how on earth they’re seeing it hitting blocking IO because something’s misbehaving in the background drowning out the disks ability to service work? They don’t run in a well controlled server farm, but even there you can expect to see drives behaving poorly leading to io blocking, screwing up the cache performance which in well behaved environments works fine but in degraded environments means you defer to the network with a 0s timeout, meaning you might get no value from the local cache at all but the dual attempt minimizes the “decision time” by instead of waiting for a timeout then requesting it instead requests and goes with what comes back first. Even on a well behaved well managed hosts stuff can wake up and start indexing thinking it’s a good quiet time for that and for short periods you might end up with some slowdown in the cache behavior. Even on that normal multiprocess machine that’s well tuned you’ll sometimes see the cache slowing below the network.
The cost is the cache isn’t masking the remote infrastructures true demand. But there’s no spec that says it must afaik. I think the most interesting angle on this is that - infrastructure users serving data are paying for this latency optimization. Is that right? My view is this is the concerning thing - if widely adopted it could materially increase the cost to service providers some of whom operate on thin or negative margins, or with no income at all.
What if it’s a virus scan or some background indexer browning out io resources on the disk path, or given it’s a consumer browser, some malware eating up resources unbeknownst to the user of the browser. Surely you’ve experienced noisy neighbors :-)
It’s an interesting approach because it removes the assumption browser caches mask volume for infrastructure purposes rather than user latency. I think you can overwhelm costs for low or no income services. If this is widely adopted we ask the providers of our servers to pay for that latency improvement.
They were already partitioned by the requested eTLD+1, because they were already partitioned by the requested URL, which contains the requested eTLD+1. Chrome's change was additionally partitioning it by 2 more eTLD+1s: the top-level eTLD+1 and the frame eTLD+1. It looks like Firefox's change was to add 1 eTLD+1 to the original behavior: the top-level eTLD+1.
> This seemed odd to me, surely using a cached response would always be faster than making the whole request again! Well it turns out that in some cases, the network is faster than the cache.
Did I miss a follow up on this, or did it remain unanswered as to what the benefit of racing against the network is?
The post basically says that sometimes the cache is slower because of throttling or bugs, but mostly bugs.
Why is Firefox sending an extra request instead of figuring out what is slowing down the cache? It seems like an overly expensive mitigation…
only reason i can think of is slow mechanical hard drive vs fast network.
or if its based on etag and it has to hash the local resource, but it would be smart to cache that hash too.
or maybe the os has too many file handles open and its doing a poor job of reading many files at once, whereas the network offers a 2nd tube to funnel data through...
Well, yeah. Disk cache can take hundreds of MS to retrieve, even on modern SSDs. I had a handful of oddly heated discussions with an architect about this exact thing at my previous job. Showing him the network tab did not because he had read articles and was well informed about these things.
At a previous job I worked on serving data from SSDs. I wasn't really involved in configuring the hardware but I believe they were good quality enterprise-grade SSDs. My experience was that a random read (which could be a small number of underlying reads) from mmap()'ed files from those SSDs took between 100 and 200 microseconds. That's far from your figure of hundreds of milliseconds.
Of course 200 microseconds still isn't fast. That translates to serving 5000 requests per second, leaving the CPU almost completely idle.
Another odd fact was that we in fact did have to implement our own semaphores and throttling to limit concurrent reads from SSDs.
(Anyone who is not running their OS and temp space on NVME should not expect good performance. Such a configuration has been very cheap for several years now.)
> Such a configuration has been very cheap for several years now.
This is a very weird comment, considering that a) it's cheaper than yesteryear but SATA SDD (or even modern magnetic HDDs) are still sold and are in active use and b) ignores phones completely, where a large number of sites would have mobile-dominated visitors and can't just switch to an NVMe-like performance even for those with large disposable incomes (because at the end of the day even with UFS phones are still slower than NVMe latency-wise).
The issue has nothing to do with disk speed. If you had read the article you'd see a very nice chart that shows the vast majority of cache hits returning in under 2 or 3ms.
I wish I had a clearer memory or record of this, but I think I’ve also ~100ms for browser cache retrieval on an SSD. Has anyone else observed this and have an explanation? A sibling comment points out that SSD read latency should be ~10ms at most so the limitation must be in the software?
OP mentioned specifically that “there have been bugs in Chromium with request prioritisation, where cached resources were delayed while the browser fetched higher priority requests over the network” and that “Chrome actively throttles requests, including those to cached resources, to reduce I/O contention”. I wonder if there are also other limitations with how browsers retrieve from cache.
> Concatenating / bundling your assets is probably still a good practice, even on H/2 connections. Obviously this comes on balance with cache eviction costs and splitting pre- and post-load bundles.
I guess this latter part refers to the trade-off between compiling all your assets into a single file, and then requiring clients to re-download the entire bundle if you change a single CSS color. The other extreme is to not bundle anything (which, I gather from the article, is the standard practice since all major browsers support HTTP/2) but this leads to the described issue.
What about aggressively bundling, but also keeping track at compile time of diffs between historical bundles and the new bundle? Re-connecting clients could grab a manifest that names the newest mega-bundle as well as a map from historical versions to the patches needed to bring them up to date. A lot more work on the server side but maybe it could be a good compromise?
Of course that's the easy version, but it has a huge flaw which is all first-time clients have to download the entire huge mega bundle before the browser can render anything, so to make it workable it would have to compile things into a few bootstrap stages instead of a single mega-bundle.
I am clearly not a frontend dev. If you're going to throw tomatoes please also tell me why ;)
* edit: found the repo that made me think of this idea, https://github.com/msolo/msolo/blob/master/vfl/vfl.py but it's from antiquity and probably predates node and babel/webpack, but the idea is you name individual asset versions with a SHA or tag and let them be cached forever, and to update the web app you just change a reference in the root to make clients download a different resource dependency tree root (and they re-use unchanged ones) *
There's probably some balance here. Since there's a limit of 9 concurrent requests before throttling occurs you can bucket 9 objects, concatenating into each. So if you have a bunch of static content, concat that into 1 bucket. If you have another object that changes a lot, keep that separate. If you have two other objects that change together, bucket those, etc.
Seems like a huge pain to think about tbh. Seems like part of the problem would be solved by compiling everything into a single file that supported streaming execution.
There's a good middle ground of bundling your SPA into chunks of related files (I prefer to name them the SHA hash of the content), and giving them good cache lifetimes.
You can have a "vendor" chunk (or a few) which just holds all 3rd party dependencies, a "core components" chunk which holds components which are likely used on most pages, and then individual chunks for the rest of the app broken down by page or something.
It speeds up compilation, gives better caching, no need for a stateful mapping file outside of the HTML file loaded (which is kind of the point of the <link> tag anyway!), and has lots of knobs to tune if you want to really squeeze out the best load times.
Another tool that bears on this idea is courgette [0], in that it operates on one of the intermediate representations of the final bundle in order to achieve better compression.
When I first saw this, I was confused why Linux (and Ubuntu and Debian) performed so poorly. But then I asked myself; is the data even representable? Or is Linux worse because users on Linux tend to stick to lower end hardware, since the measurenents were, AFAIR, taken for granted from website visitors instead of being tested on same hardware with different operating systems installed?
I don't know, can someone explain the discrepancy?
Is the raw data available? What is the number of requests that were included in the measurements for Debian (that doesn't hit cache in the first 5 ms)?
I wonder if browsers could design a heuristic to cache multiple items in one cache entry -- that is, instead of the website doing file concatenation, the browser dynamically decides to concatenate some resources in order to reduce the number of asks to the local cache. For example, the browser would know on the initial load which requests get made together (such as three js files or four images), so they could get batched into a single cache entry
So is the takeaway that data in the RAM of some server connected by fast network is sometimes "closer" in retrieval time than that same data on a local SSD?
Back in ~2003 I had bought a new motherboard + cpu (a Duron 800MHz IIRC) but as a poor college kid, only had enough money left over for 128MB of RAM.. but the system I was replacing had ~768MB. I made a ~640MB ramdisk on the old system and mounted it on the new system as a network block device, and the result was much, much faster than local swap (this was before consumer SSDs though).
Now I'm imagining a rack of raspis acting as one giant ram swap drive over nbd. This could work for a given value of "work". cost of pi vs stick of ram. A kv storage as well perhaps.
Then again, whats a TB worth on just one xeon server? Probably cheaper... or not?
When you make the request the server the server will have to look up the image from its own "cache" before having to send it back to you. The client's cache would have to be not only slower than it's ping, but slower than it's ping + the server's "cache."
Tl;dr: cache is "slow" because the number of ongoing requests-- including to cache! --are throttled by the browser. I.e the cache isn't slow, but reading it is waited upon and the network request might be ahead in the queue.
Then that implementation doesn't even save bandwidth (for client nor server), which is one of the purposes of having a cache in the first place (together with other reasons).
For me this is very noticeable whenever I open a new Chrome tab. It takes 3+ seconds for the icons of the recently visited sites to appear, whatever cache is used for the favicons is extremely slow. Thankfully the disk cache for other resources runs at a more normal speed.
In ~2010, I was benchmarking Solarflare (Xilinx/AMD now) cards and their OpenOnload kernel-bypass network stack. The results showed that two well-tuned systems could communicate faster (lower latency) than two CPU sockets within the same server that had to wait for the kernel to get involved (standard network stack). It was really illuminating and I started re-architecting based on that result.
Backing out some of that history... in ~2008, we started using FPGAs to handle specific network loads (US equity market data). It was exotic and a lot of work, but it significantly benefited that use case, both because of DMA to user-land and its filtering capabilities.
At that time our network was all 1 Gigabit. Soon thereafter, exchanges started offering 10G handoffs, so we upgraded our entire infrastructure to 10G cut-through switches (Arista) and 10G NICs (Myricom). This performed much better than the 1G FPGA and dramatically improved our entire infrastructure.
We then ported our market data feed handler's to Myricom's user-space network stack, because loads were continually increasing and the trading world was continually more competitive... and again we had a narrow solution (this time in software) to a challenging problem.
Then about a year later, Solarflare and it's kernel-compatible OpenOnload arrived and we could then apply the power of kernel bypass to our entire infrastructure.
After that, the industry returned to FPGAs again with 10G PHY and tons of space to put whole strategies... although I was never involved with that next generation of trading tech.
I personally stayed with OpenOnload for all sorts of workloads, growing to use it with containerization and web stacks (Redis, Nginx). Nowadays you can use OpenOnload with XDP; again a narrow technology grows to fit broad applicability.