Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Are your memory-bound benchmarking timings normally distributed? (lemire.me)
94 points by mfiguiere on April 6, 2023 | hide | past | favorite | 46 comments


I've been writing micro memory benchmarks and have been rather surprised how hard something as simple as quantifying latency and bandwidth under multicore loads can be. The memory hierarchy is getting ever more complex. Cacheline sizes, prefetch, 3 levels of cache, TLB effects, page alignments, cache associativity, etc. Also have to be careful that the compiler doesn't optimize away parts of your code. It's quite tricky to get a nice clean array size vs latency graph, doubly so when multiple cores are involved.

Some of my assumptions about latency were wrong. One thing I didn't realize is it takes about half the latency to main memory to get miss through L1, L2, and L3. Also that you need to have around 2x the memory references pending to keep the memory system busy. It makes sense in retrospect, you want 16 pending memory references to keep 8 memory channels busy, otherwise a memory channel will return a cache line, and there won't be any L3 cache misses pending for that channel.

Generally I like to keep a small histogram of cycle counters to make sure I'm seeing the distribution I expect, seeing an unusual distribution is key for tracking down something you didn't account for.


I want a viewport onto a parallel universe where we embraced NUMA more thoroughly, and instead of making processors with ever growing layers of transparent caching, we just put 16MB of working memory on each chip. Partitionable for concurrent workloads.

Maybe it could never have been then, but maybe it can be now, or soon.

For instance a borrow checker might be a very good way to help decide how and when to move working memory between processors and main memory. But what do you do with all of the existing C code? I think for backward compatibility you'd need to implement virtual memory at the working memory layer. It wouldn't be the fastest code, but it would work.


Well the tools are there, and the hardware is there as well.

Early AMD epycs had multiple chiplets, each with it's own memory controller. However that means that even on a single socket different chiplets would see different memory latencies and some apps/benchmarks people cared about handled it poorly.

The newest AMD Epycs have a unified memory controller per socket, so all chiplets see the same memory latency. But now Intel entered the chiplet game, and now has one memory controller per chiplet again.

There's plenty of support for NUMA support in the kernel and various language runtimes. Apps and OSs can pin processes to cores, allocate local ram, and migrate non-local ram to local or migrate processes to cores that are memory local. Generally the automatic stuff works reasonably, but numactl and various NUMA related calls let you override those decisions as needed.


I missed the 16 bit era by dragging my feet and then leaping onto 32 bit as soon as it was viable.

64k of memory was not really enough to do useful things and it occupied so much of your design. But 4GB is a pretty big leap from 64k. There's lots of interesting things a person can do with 10, 30, 100MB of memory, without having to treat every piece of data like it sits in a perfectly flat data plane. And with 64 bit addressing you have ample, ample space to use some bits as metadata that lets you know things about the memory without dereferencing it first.


Heh, I was running DOS and Deskview, windows 3 was out of coming soon. I jumped into Linux with a 386sx-16, not even a FPU. I think it had 4MB ram, eventually upgraded to 16mb.

Your ideas remind me of the Sun MAJC CPU. Claimed to be designed to run Java well, be aware of Java Objects, enabling prefetch of objects, speculative execution without side effects, and other pointer magic related to assuming JVM+Java instead of C/C++. Sounded really promising, sadly died shortly afterwards. An expert I talked with later mentioned that it was all marketing and it was just a normal CPU for it's time. Sad. Does seem like today's new languages could result in a CPU with significantly more freedom for speculation, optimization, and efficiency if it assumed a particular language runtime.


> Your ideas remind me of the Sun MAJC CPU.

Probably not a coincidence. It was memory of some Sun product data sheets that got me ruminating on this line of thought, though I don't think it was MAJC. 4MB per core, 4 cores per daughter card, if memory serves. Those product lines were probably a big reason why Java had to solve the NUMA memory problem, and then other languages just copied what they did.


Hum... The hardware isn't actually there.

It has the capabilities the GP wants, but it doesn't export them in a way you could tune into that NUMA machine.


You can run any process on any core and allocate local memory to that core. What is missing?

Sure you can't hide metadata in the unused bits of a pointer, but that didn't seem particularly critical for making good use of a NUMA machine.


> What is missing?

Fast memory, addressable by the core.

The one that is there, but can only used by accessing the slow memory.


And praying that the slow memory is magically in the fast memory.

I think ultimately what I'm asking for means a much, much tighter bound on worst case performance, but at the cost of best case performance. That most of us never see anyway. For certain workloads, that could end up being a net positive. And there's probably some way to expose CPU metadata that gives some of that difference back.


I think it’s worthwhile to distinguish between throughput and latency for these sorts of discussions, rather than just talking about performance since scratchpads are usually better for latency (even best-case latency) and caches are usually better for throughput. Though of course in this as in any sort of discussion of computer performance, caveats abound.


We kinda had that world with the PS3 and SPE/SPUs[1]. Stuff written for them tended to run faster on more traditional hardware since you needed to vectorize all of your loads(which made the prefetcher on other hardware really happy) but it was an absolute beast to port in any code. I remember stuff from PC/X360 just running like absolute garbage(not that multicore was well handled anyway during that era) until significant refactoring was done(up to and even including full rewrites in some cases).

[1] https://en.m.wikipedia.org/wiki/Cell_(processor)#Synergistic...


> we just put 16MB of working memory on each chip.

Just? The chips we've got have between 32kb and 64kb -- maybe 128kb in some really big chips, and it's been that way for forty years! Everything else has to swap or cache through messaging layers (L2, L3) with real latency because we can't "just" put a bunch of memory on a chip.

I mean, I like thinking about star-trek computers too, but "just" isn't the word I would use for something like this... What would you possibly do with so much memory in a circuit?

> But what do you do with all of the existing C code? I think for backward compatibility you'd need to implement virtual memory at the working memory layer.

People rewrote almost everything in Java. And are doing it again with Rust. Programmers want to program, and if they can't come up with anything else to do, they'll just rewrite what someone else did. I don't think you should valuate existing C code the way its owners do, but by how their competition will valuate it; If someone can use FutureLang to outperform their business-competition who uses C, this problem will solve itself.


Reminds me of Movidius Myriad [0] or NEC IMAPCAR [1]. Other comments also mention Cell.

Programmung those is so different that you need to dressing specifically for the chip and unless there is no great consolidation, they stay niche.

[0] https://www.tomshardware.com/news/movidiud-myriad2-vpu-visio...

[1] https://www.nec.com/en/global/techrep/journal/g06/n05/pdf/t0...


This is basically how tile renderers work in mobile GPUs. Supercomputers also have a similar design with MPI and toroid networks although on a multi-machine scale.


Partitioning memory and cores is easy enough, and a bunch of Xeons have cache partitioning. It's still generally transparent to the program, but I don't think that makes it less NUMA.


Great article!

Statistical fallacies are rampant in performance eval, even in academic settings. When designing statistical tests for performance, the keyword you want to use here is non-parametric. I.e., a U-test is a non-parametric analog to the t-test. It just looks at the rank statistics of results instead of their value, thus eliminating dependence on the underling distribution.

Another issue that pops up is sample independence. Statistical tests are often predicated on each sample being independent and identically distributed (i.i.d.), but in reality this is often not the case. For instance, running all the tests of one group and then all the tests of the other could heat the CPU and cause reduced performance in the second trial.


We use non-parametric statistics for performance testing in ClickHouse[1], picked from the article "A Randomized Design Used in the Comparison of Standard and Modified Fertilizer Mixtures for Tomato Plants".

[1] https://clickhouse.com/blog/testing-the-performance-of-click...


Many would do well to look at simple DOE when building experimental design. At least look at the data distributions before taking means and stdev.

   NIST has some nice simple descriptions and example experimental designs:
   https://www.itl.nist.gov/div898/handbook/pmd/pmd.htm


You might want to check out the gamma distribution. It is also zero-bounded just like the log normal distribution, but it was originally created to model waiting times within queue theory, which is actually an excellent parallel to the idea of measuring compute latency.


Gamma and delta distributions have been very helpful to me in performance work, as well as non-parametric statistical tests. However, when you try to tell a lot of other engineers about them, they don't really understand why a t-test and a standard deviation doesn't work.


I too am an engineer that doesn't understand why/when you would use a gamma distribution. Could you point me in the right direction?


There's a pretty good section on the natural occurrences and applications of the distribution in the wikipedia article.

https://en.wikipedia.org/wiki/Gamma_distribution#Occurrence_...


> Consider a sequence of events, with the waiting time for each event being an exponential distribution with rate β

Why is the exponential distribution of waiting times a reasonable assumption in this case?


In the general case? Because the exponential distribution is directly the result of a poisson process, and poisson processes are also naturally ubiquitous. A poisson process results in poisson distributed counts of events over fixed periods of time, and exponentially distributed times between events. They are used to model everything from rainfall to particle physics, and those concepts do fit very neatly with things that are obviously more complicated like human dynamics.

In the case of memory-bound code benchmarking? Because memory access in NUMA architectures with tiered caches are approximately exponential: L1 vs L2 vs L3 vs main memory vs swap are all exponential increases in latencies. It's obviously more of a step function than a continuous function but the relationship is definitely exponential.

The pages for the exponential distribution and poisson distribution are also informative and interesting.

https://en.wikipedia.org/wiki/Exponential_distribution#Occur...

https://en.wikipedia.org/wiki/Poisson_distribution#Occurrenc...

When I go through the process of finding an appropriate distribution, I usually first look through wikipedia to see if I can find conceptual parallels, just like this. That doesn't mean I settle on it..I'll usually test the distribution fit with functions like those found in fitdistr plus, as well as test to see if residuals are roughly normally distributed.

One hard and fast rule that I use though is that I never use an unbounded distribution to model a bounded process. For compute times, you can't have negative latencies, and you can't have zero latencies, but you can have infinite latencies. Therefore, I would only consider using distributions in the space of (0,infinity). The log normal distribution fits this and is probably adequate here, but the gamma distribution might be a smidge better.

https://cran.r-project.org/web/packages/fitdistrplus/vignett...


Yeah that is what I can't figure out


What's extra fun is you can run into a distribution that doesn't have a mean.

A concrete example of this: Exponential backoff where each attempt has a 1/2 probability of succeeding independently of other attempts, and you double the time between each attempt.

If you try to benchmark this process, you'll find that no matter how big your sample size is you'll never get consistent results.


That’s only true though if you don’t have a time-out at some point.


One thing I discovered benching Oracle instances is that cloud performance is extremely non deterministic. Sometimes you get variance each run. Sometimes you get consistent results, but then you bench it later and the results are consistent, but totally different.

Then I would run the same bench on my laptop and (with turbo disabled) ands its pretty much always consistent.

I can only guess why... turbo on the host cpu? Other transient tenants competing for the same cache? Something to do with VMs? Also I am not sure if AWS has this issue.


You're sharing the same hardware, with who knows how many others.


Yeah. Hence I'm saying an AWS c6i is maybe the wrong platform for benchmarks...

Bare metal ia OK if you must test NUMA and such, but you are still at the mercy of turbo behavior, where you can completely control that in a local machine.


I mean it's physically impossible for the generating process of positive timings to be normally distributed. The normal distribution has support x ∈ ℝ.


Usually I don't do any statistical tests when benchmarking and optimizing.

("If your result depends on statistics then you need a better experiment.” Oft quoted remark attributed to Rutherford.)

s/experiment/optimization/

I'm happy with min, max and average. max is actually often the most important.

On the other hand the point of view in this paper is interesting:

"Violating the normality assumption may be the lesser of two evils" (1)

If you really feel the urge to do statistical tests then do it.

(I do not work in medicine or anything dangerous at all)

https://link.springer.com/article/10.3758/s13428-021-01587-5 (1)


The only extra statistic in the article not on your list is standard deviation. I don’t know whom this comment is aimed at.


The comment is a bit abrupt.

The article questions the normal distribution assumption for benchmark timings.

It says that this assumption would be important for the "standard error of the mean":

"Your error should go down as the square root of the number of measures."

Standard distribution is not even necessary for this:

https://en.wikipedia.org/wiki/Standard_error

However, normal distributions actually behave benignly and are a prerequisite for some statistical tests.

This is where my first comment gets in, that for my benchmarking statistical tests are not necessary at all ...

---

The author writes that he only measures minimum and average, but the maximum values are of course also important and worth to measure.

The author's key point is that without a normal distribution, the maximum values can be highly scattered. I agree and basically it is always nice to know the distribution of the random variable.


By definition timing benchmarks cannot be Gaussian since the time cannot be negative. For anything that has such bound on one side log-normal is a safer default choice and the usage of the normal distribution must be justified.


Caches and other hidden state complicate matters significantly.


From the OP:

> The typical assumption is that we get a normal distribution, and so we should therefore report the average time.

That "assumption" is asking a lot and is not justified. Actually, the main source of normal distributions is from the central limit theorem where get the distribution from adding lots of random variables.

> ... therefore report the average time

There are good reasons to consider averages for any distribution, "normal" or not.

Of course, as in a classic paper by Halmos and Savage, there is the topic of sufficient statistics that for the normal distribution is a bit amazing. Gee, maybe the OP (original post) was thinking about sufficient statistics. But for the normal distribution, the sufficient statistics are the pair of sample average and sample standard deviation.

> If you have normal distributions, you can mathematically increase the accuracy of the measure by taking more samples.

This is justified by the law of large numbers, both the strong and weak versions, where need not assume a normal distribution. Texts where the law of large numbers is proven in great detail and generality are by authors Loeve, Chung, Neveu, and Breiman, among others.

> It is not possible, in a normal distribution, to be multiple times the standard deviation away from the mean.

Sorry, with the normal distribution, there is positive probability of samples positive or negative with finite absolute value as large as please.


I've found timings in general were lognormal a long time ago. I had to routinely apply transformations to get meaningful ANOVA and DOE results.


The obvious transformation is the log. Are there other?


Yes, box-cox can be used, even if the lognormal itself isn't a perfect match for the actual distribution, with the correct choice of lambda.


> It is not possible, in a normal distribution, to be multiple times the standard deviation away from the mean.

Is "possible" supposed to be "probable" there?


I feel obligated to point out that, by the central limit theorem, it doesn’t matter if the timings are normally distributed. The measurements only need to be independently and identically distributed with finite variance for the estimate of the mean to be normally distributed.

https://en.m.wikipedia.org/wiki/Central_limit_theorem


I think he's saying that since the distribution follows a log-normal, the mean is harder to pin down than if it were a normal due to the higher variance. From the CLT, the variance of the mean statistic is var(X)/sqrt(N) -- the var(X) is higher and he'd expect from an assumed normal of X.

The other point he makes is that the mean is not very representative in this wide log-normal scenario, and optimizing for the difference between min and mean is more relevant.

My 2 cents.

Also to nitpick on the article:

"It is not possible, in a normal distribution, to be multiple times the standard deviation away from the mean."

I guess he meant probable :)


One can also work in log space to do the stats, and report the geometric mean in linear space…


I think we're nitpicking too much here, but the main takeaway seems to be to not assume normals and to actually measure things. Seems pretty straightforward advice, but if you've worked anywhere, you'll know how often people content themselves with reporting averages.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: