I've been writing micro memory benchmarks and have been rather surprised how hard something as simple as quantifying latency and bandwidth under multicore loads can be. The memory hierarchy is getting ever more complex. Cacheline sizes, prefetch, 3 levels of cache, TLB effects, page alignments, cache associativity, etc. Also have to be careful that the compiler doesn't optimize away parts of your code. It's quite tricky to get a nice clean array size vs latency graph, doubly so when multiple cores are involved.
Some of my assumptions about latency were wrong. One thing I didn't realize is it takes about half the latency to main memory to get miss through L1, L2, and L3. Also that you need to have around 2x the memory references pending to keep the memory system busy. It makes sense in retrospect, you want 16 pending memory references to keep 8 memory channels busy, otherwise a memory channel will return a cache line, and there won't be any L3 cache misses pending for that channel.
Generally I like to keep a small histogram of cycle counters to make sure I'm seeing the distribution I expect, seeing an unusual distribution is key for tracking down something you didn't account for.
I want a viewport onto a parallel universe where we embraced NUMA more thoroughly, and instead of making processors with ever growing layers of transparent caching, we just put 16MB of working memory on each chip. Partitionable for concurrent workloads.
Maybe it could never have been then, but maybe it can be now, or soon.
For instance a borrow checker might be a very good way to help decide how and when to move working memory between processors and main memory. But what do you do with all of the existing C code? I think for backward compatibility you'd need to implement virtual memory at the working memory layer. It wouldn't be the fastest code, but it would work.
Well the tools are there, and the hardware is there as well.
Early AMD epycs had multiple chiplets, each with it's own memory controller. However that means that even on a single socket different chiplets would see different memory latencies and some apps/benchmarks people cared about handled it poorly.
The newest AMD Epycs have a unified memory controller per socket, so all chiplets see the same memory latency. But now Intel entered the chiplet game, and now has one memory controller per chiplet again.
There's plenty of support for NUMA support in the kernel and various language runtimes. Apps and OSs can pin processes to cores, allocate local ram, and migrate non-local ram to local or migrate processes to cores that are memory local. Generally the automatic stuff works reasonably, but numactl and various NUMA related calls let you override those decisions as needed.
I missed the 16 bit era by dragging my feet and then leaping onto 32 bit as soon as it was viable.
64k of memory was not really enough to do useful things and it occupied so much of your design. But 4GB is a pretty big leap from 64k. There's lots of interesting things a person can do with 10, 30, 100MB of memory, without having to treat every piece of data like it sits in a perfectly flat data plane. And with 64 bit addressing you have ample, ample space to use some bits as metadata that lets you know things about the memory without dereferencing it first.
Heh, I was running DOS and Deskview, windows 3 was out of coming soon. I jumped into Linux with a 386sx-16, not even a FPU. I think it had 4MB ram, eventually upgraded to 16mb.
Your ideas remind me of the Sun MAJC CPU. Claimed to be designed to run Java well, be aware of Java Objects, enabling prefetch of objects, speculative execution without side effects, and other pointer magic related to assuming JVM+Java instead of C/C++. Sounded really promising, sadly died shortly afterwards. An expert I talked with later mentioned that it was all marketing and it was just a normal CPU for it's time. Sad. Does seem like today's new languages could result in a CPU with significantly more freedom for speculation, optimization, and efficiency if it assumed a particular language runtime.
Probably not a coincidence. It was memory of some Sun product data sheets that got me ruminating on this line of thought, though I don't think it was MAJC. 4MB per core, 4 cores per daughter card, if memory serves. Those product lines were probably a big reason why Java had to solve the NUMA memory problem, and then other languages just copied what they did.
And praying that the slow memory is magically in the fast memory.
I think ultimately what I'm asking for means a much, much tighter bound on worst case performance, but at the cost of best case performance. That most of us never see anyway. For certain workloads, that could end up being a net positive. And there's probably some way to expose CPU metadata that gives some of that difference back.
I think it’s worthwhile to distinguish between throughput and latency for these sorts of discussions, rather than just talking about performance since scratchpads are usually better for latency (even best-case latency) and caches are usually better for throughput. Though of course in this as in any sort of discussion of computer performance, caveats abound.
We kinda had that world with the PS3 and SPE/SPUs[1]. Stuff written for them tended to run faster on more traditional hardware since you needed to vectorize all of your loads(which made the prefetcher on other hardware really happy) but it was an absolute beast to port in any code. I remember stuff from PC/X360 just running like absolute garbage(not that multicore was well handled anyway during that era) until significant refactoring was done(up to and even including full rewrites in some cases).
> we just put 16MB of working memory on each chip.
Just? The chips we've got have between 32kb and 64kb -- maybe 128kb in some really big chips, and it's been that way for forty years! Everything else has to swap or cache through messaging layers (L2, L3) with real latency because we can't "just" put a bunch of memory on a chip.
I mean, I like thinking about star-trek computers too, but "just" isn't the word I would use for something like this... What would you possibly do with so much memory in a circuit?
> But what do you do with all of the existing C code? I think for backward compatibility you'd need to implement virtual memory at the working memory layer.
People rewrote almost everything in Java. And are doing it again with Rust. Programmers want to program, and if they can't come up with anything else to do, they'll just rewrite what someone else did. I don't think you should valuate existing C code the way its owners do, but by how their competition will valuate it; If someone can use FutureLang to outperform their business-competition who uses C, this problem will solve itself.
This is basically how tile renderers work in mobile GPUs. Supercomputers also have a similar design with MPI and toroid networks although on a multi-machine scale.
Partitioning memory and cores is easy enough, and a bunch of Xeons have cache partitioning. It's still generally transparent to the program, but I don't think that makes it less NUMA.
Statistical fallacies are rampant in performance eval, even in academic settings. When designing statistical tests for performance, the keyword you want to use here is non-parametric. I.e., a U-test is a non-parametric analog to the t-test. It just looks at the rank statistics of results instead of their value, thus eliminating dependence on the underling distribution.
Another issue that pops up is sample independence. Statistical tests are often predicated on each sample being independent and identically distributed (i.i.d.), but in reality this is often not the case. For instance, running all the tests of one group and then all the tests of the other could heat the CPU and cause reduced performance in the second trial.
We use non-parametric statistics for performance testing in ClickHouse[1], picked from the article "A Randomized Design Used in the Comparison of Standard and Modified Fertilizer Mixtures for Tomato Plants".
You might want to check out the gamma distribution. It is also zero-bounded just like the log normal distribution, but it was originally created to model waiting times within queue theory, which is actually an excellent parallel to the idea of measuring compute latency.
Gamma and delta distributions have been very helpful to me in performance work, as well as non-parametric statistical tests. However, when you try to tell a lot of other engineers about them, they don't really understand why a t-test and a standard deviation doesn't work.
In the general case? Because the exponential distribution is directly the result of a poisson process, and poisson processes are also naturally ubiquitous. A poisson process results in poisson distributed counts of events over fixed periods of time, and exponentially distributed times between events. They are used to model everything from rainfall to particle physics, and those concepts do fit very neatly with things that are obviously more complicated like human dynamics.
In the case of memory-bound code benchmarking? Because memory access in NUMA architectures with tiered caches are approximately exponential: L1 vs L2 vs L3 vs main memory vs swap are all exponential increases in latencies. It's obviously more of a step function than a continuous function but the relationship is definitely exponential.
The pages for the exponential distribution and poisson distribution are also informative and interesting.
When I go through the process of finding an appropriate distribution, I usually first look through wikipedia to see if I can find conceptual parallels, just like this. That doesn't mean I settle on it..I'll usually test the distribution fit with functions like those found in fitdistr plus, as well as test to see if residuals are roughly normally distributed.
One hard and fast rule that I use though is that I never use an unbounded distribution to model a bounded process. For compute times, you can't have negative latencies, and you can't have zero latencies, but you can have infinite latencies. Therefore, I would only consider using distributions in the space of (0,infinity). The log normal distribution fits this and is probably adequate here, but the gamma distribution might be a smidge better.
What's extra fun is you can run into a distribution that doesn't have a mean.
A concrete example of this: Exponential backoff where each attempt has a 1/2 probability of succeeding independently of other attempts, and you double the time between each attempt.
If you try to benchmark this process, you'll find that no matter how big your sample size is you'll never get consistent results.
One thing I discovered benching Oracle instances is that cloud performance is extremely non deterministic. Sometimes you get variance each run. Sometimes you get consistent results, but then you bench it later and the results are consistent, but totally different.
Then I would run the same bench on my laptop and (with turbo disabled) ands its pretty much always consistent.
I can only guess why... turbo on the host cpu? Other transient tenants competing for the same cache? Something to do with VMs? Also I am not sure if AWS has this issue.
Yeah. Hence I'm saying an AWS c6i is maybe the wrong platform for benchmarks...
Bare metal ia OK if you must test NUMA and such, but you are still at the mercy of turbo behavior, where you can completely control that in a local machine.
I mean it's physically impossible for the generating process of positive timings to be normally distributed. The normal distribution has support x ∈ ℝ.
However, normal distributions actually behave benignly and are a prerequisite for some statistical tests.
This is where my first comment gets in, that for my benchmarking statistical tests are not necessary at all ...
---
The author writes that he only measures minimum and average, but the maximum values are of course also important and worth to measure.
The author's key point is that without a normal distribution, the maximum values can be highly scattered. I agree and basically it is always nice to know the distribution of the random variable.
By definition timing benchmarks cannot be Gaussian since the time cannot be negative. For anything that has such bound on one side log-normal is a safer default choice and the usage of the normal distribution must be justified.
> The typical assumption is that we get a normal distribution, and so we should therefore report the average time.
That "assumption" is asking a lot and is not justified. Actually, the main source of normal distributions is from the central limit theorem where get the distribution from adding lots of random variables.
> ... therefore report the average time
There are good reasons to consider averages for any distribution, "normal" or not.
Of course, as in a classic paper by Halmos and Savage, there is the topic of sufficient statistics that for the normal distribution is a bit amazing. Gee, maybe the OP (original post) was thinking about sufficient statistics. But for the normal distribution, the sufficient statistics are the pair of sample average and sample standard deviation.
> If you have normal distributions, you can mathematically increase the accuracy of the measure by taking more samples.
This is justified by the law of large numbers, both the strong and weak versions, where need not assume a normal distribution. Texts where the law of large numbers is proven in great detail and generality are by authors Loeve, Chung, Neveu, and Breiman, among others.
> It is not possible, in a normal distribution, to be multiple times the standard deviation away from the mean.
Sorry, with the normal distribution, there is positive probability of samples positive or negative with finite absolute value as large as please.
I feel obligated to point out that, by the central limit theorem, it doesn’t matter if the timings are normally distributed. The measurements only need to be independently and identically distributed with finite variance for the estimate of the mean to be normally distributed.
I think he's saying that since the distribution follows a log-normal, the mean is harder to pin down than if it were a normal due to the higher variance. From the CLT, the variance of the mean statistic is var(X)/sqrt(N) -- the var(X) is higher and he'd expect from an assumed normal of X.
The other point he makes is that the mean is not very representative in this wide log-normal scenario, and optimizing for the difference between min and mean is more relevant.
My 2 cents.
Also to nitpick on the article:
"It is not possible, in a normal distribution, to be multiple times the standard deviation away from the mean."
I think we're nitpicking too much here, but the main takeaway seems to be to not assume normals and to actually measure things. Seems pretty straightforward advice, but if you've worked anywhere, you'll know how often people content themselves with reporting averages.
Some of my assumptions about latency were wrong. One thing I didn't realize is it takes about half the latency to main memory to get miss through L1, L2, and L3. Also that you need to have around 2x the memory references pending to keep the memory system busy. It makes sense in retrospect, you want 16 pending memory references to keep 8 memory channels busy, otherwise a memory channel will return a cache line, and there won't be any L3 cache misses pending for that channel.
Generally I like to keep a small histogram of cycle counters to make sure I'm seeing the distribution I expect, seeing an unusual distribution is key for tracking down something you didn't account for.