> As I understand it, hardware counters would remain consistent in the face of the normal noisy CI runner.
With cloud CI runners you'd still have issues with hardware differences, e.g. different CPUs counting slightly differently. even memcpy behavior is hardware-dependent! And if you're measuring multi-threaded programs then concurrent algorithms may be sensitive to timing. Also microcode updates for the latest CPU vulnerabilities. And that's just instruction counts. Other metrics such as cycle counts, cache misses or wall-time are far more sensitive.
To make sure we're not slowly accumulating <1% regressions hidden in the noise and to be able to attribute regressions to a specific commit we need really low noise levels.
So for reliable, comparable benchmarks dedicated is needed.
> With cloud CI runners you'd still have issues with hardware differences
For my project it really is the diff of each commit, which means that I start from a parent commit that isn’t part of the PR and re-measure that, then for each new commit. This should avoid accounting for changes in hardware as well as things like Rust versions (if those aren’t locked in via rustup).
The rest of your points are valid of course, but this was a good compromise for my OSS project where I don’t wish to spend extra money.
With cloud CI runners you'd still have issues with hardware differences, e.g. different CPUs counting slightly differently. even memcpy behavior is hardware-dependent! And if you're measuring multi-threaded programs then concurrent algorithms may be sensitive to timing. Also microcode updates for the latest CPU vulnerabilities. And that's just instruction counts. Other metrics such as cycle counts, cache misses or wall-time are far more sensitive.
To make sure we're not slowly accumulating <1% regressions hidden in the noise and to be able to attribute regressions to a specific commit we need really low noise levels.
So for reliable, comparable benchmarks dedicated is needed.