Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Yeah, that channels are populated correctly. As you can see from the mlc-results.txt, the latency looks fine:

   mlc --idle_latency
  Intel(R) Memory Latency Checker - v3.11b
  Command line parameters: --idle_latency

  Using buffer size of 1800.000MiB
  Each iteration took 424.8 base frequency clocks (       104.9   ns)
As does the per-channel --bandwidth_matrix results:

                  Numa node
  Numa node            0       1       2       3       4       5       6       7
         0        45999.8 46036.3 50490.7 50529.7 50421.0 50427.6 50433.5 52118.2
         1        46099.1 46129.9 52768.3 52122.3 52086.5 52767.6 52122.6 52093.4
         2        46006.3 46095.3 52117.0 52097.2 50385.2 52088.5 50396.1 52077.4
         3        46092.6 46091.5 52153.6 52123.4 52140.3 52134.8 52078.8 52076.1
         4        45718.9 46053.1 52087.3 52124.0 52144.8 50544.5 50492.7 52125.1
         5        46093.7 46107.4 52082.0 52091.2 52147.5 52759.1 52163.7 52179.9
         6        45915.9 45988.2 50412.8 50411.3 50490.8 50473.9 52136.1 52084.9
         7        46134.4 46017.2 52088.9 52114.1 52125.0 52152.9 52056.6 52115.1
I've tried various NUMA configurations (from 1 domain to a per-CCD config) and it doesn't seem to make much difference.

Updating from the board-delivered F14 to the latest 9004 F31 BIOS (the F33 releases bricked the board and required using a BIOS flasher for manual recover) gave marginal (5-10%) improvement, but nothing major.

While 1DPC, the memory is 2R (but still registers at 4800), training on every boot. The PMBW graph is probably the most useful behavior chart: https://github.com/AUGMXNT/speed-benchmarking/blob/main/epyc...

Since I'm not so concerned with CPU inference, I feel like the debugging/testing I've done is... the amount I'm going to do, which is enough to at least characterize, if not fix the performance.

I might write up a more step-by-step guide at some point to help others but for now the testing scripts are there - I think most people who are looking at theoretical MBW should probably do their own real-world testing as it seems to vary a lot more than GPU bandwidth.



To saturate the bandwidth, you would need ~16 zen4 cores but you could first try running

    lkwid -t load -i 100 -w S0:5GB:8:1:2
and see what you get. I think you should be able to get somewhere around ~200 GB/s.


w/ likwid-bench S0:5GB:8:1:2, 129136.28 MB/s . At S0:5GB:16:1:2 184734.43 MB/s (this is the max, S0:5GB:12:1:2 is 186228.62 and S0:5GB:48:1:2 is 183598.29 MB/s) - According to lstopo my 9274F has 8 dies with 3 cores on each (currently each die is set to its own NUMA domain (L3 strat). In any case, I also gave `numactl --interleave=all likwid-bench -t load -w S0:5GB:48:1:2 -i 100` a spin and topped out about the same place: 184986.45 MB/s.


Yes, you're correct that your CPU has 8 CCDs but the bw with 8 threads is already too low. Those 8 cores should be able to get you at roughly half of the theoretical bw. 8x zen5 cores for comparison can reach the ~230 GB/s mark.

Can you repeat the same lkwid experiment but with 1, 2 and 4 threads? I'm wondering when is it that it begins to detoriate quickly.

Maybe also worth doing is repeating the 8 threads but forcing lkwid to pick every third physical core so that you get 1 thread per CCD experiment setting.


1: 33586.74 2: 47371.93 4: 65870.07

With `likwid-bench -i 100 -t load -w M0:5GB:1 -w M1:5GB:1 -w M2:5GB:1 -w M3:5GB:1 -w M4:5GB:1 -w M5:5GB:1 -w M6:5GB:1 -w M7:5GB:1` we get 187976.60

Obvious there's a bottleneck either going on somewhere - at 33.5GB/s per channel, that would get close to 400GB/s, what you'd expect, but the reality is that it doesn't get to half of that. Bad MC? Bottleneck w/ the MB? Hard to tell, not sure that without swapping hardware there's much more that can be done to diagnose things.


Mixed results. I suspect you might have an ES (engineering sample) of your CPU.


Besides not having ES markings, It is a retail serial and stepping in dmidecode, so that's unlikely.


I see. I am out of other ideas besides trying to play with BIOS tweaks wrt memory and CPU. I can see that there are plenty of them, for worse or for the better.

At a quick glance, some of them look interesting such as "Workload tuning" where you can pick different profiles. There is "memory throughput intensive" profile. You can also try to explicitly disable DIMMs that are not in use given you use only half of them. I wouldn't hold my breath that any of these will make a big difference but you can give it try.


Another idea: AFAICS there have been a few memory-bw zen-related bugs reported to likwid and, in particular, https://github.com/RRZE-HPC/likwid/issues/535 may suggest that you could be hitting a similar bug but with another CPU series.

The bug report used AMDuProf to confirm that the bandwidth is actually ~2x than what likwid reported. You could try the same.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: