If you exclude branches, integer division, and memory instructions, it's easy to...

DoofusOfDeath · on Nov 24, 2020

> If you exclude branches, integer division, and memory instructions, it's easy to perfectly model your program's performance.

Assuming you're right about that, those seem to me like very big caveats.

Also, wouldn't thermals-based clock throttling be a factor as well, assuming that we're talking about wall-clock performance instead of clock-cycles performance? I'd be surprised if their impact on program performance is easy to perfectly model.

jcranmer · on Nov 24, 2020

Superoptimization looks mainly at peephole optimizations. This is looking at small fragments of code, where you're trying to shave off one or two cycles. STOKE (which is the superoptimizer I alluded to in my comment) scales to dozens of instructions, and still requires loop-free code (because SMT solvers don't handle loops at all), and I'm not sure it handles branch cases more complicated than nested if-else [1].

The correctness oracles we need for superoptimization requires use of formal methods; basically SMT libraries these days. And modern SMT libraries tend to barely work when you deal with division (and remainder) operations, or if you use floating-point code [2].

So, essentially, the size of the programs that you can apply to this code generally means you can get away with ignoring these issues for such codes. The biggest exception is branch prediction, where a cmov is better than a branch if and only if the condition is unpredictable.

> Also, wouldn't thermals-based clock throttling be a factor as well, assuming that we're talking about wall-clock performance instead of clock-cycles performance?

Clock cycle performance will largely be monotonic in wall-clock performance, especially at the level of peephole optimization. You're not adjusting the amount of memory traffic, or other microarchitectural state significantly. The difference in energy between using a shift instruction versus a multiply versus an add isn't going to cause enough thermal difference to throttle the fans. In order to really increase energy usage enough to change frequency, you'd have to go from unvectorized code to using largest register width, and that just isn't going to happen in a superoptimizer--your loop and SLP vectorizer will be the one to change that. And once you've triggered the AVX-512-heavy frequency level, you're still going to want to use the variant of that that uses fewer clock cycles anyways.

[1] It might work based on enumerating every path via path conditions, which could handle threaded jumps, but means complexity is roughly exponential in the number of conditions. Given that we're not really pushing hundreds of instructions yet, that means that branching has to be very minimal in anything our current engines can work with.

[2] The typical approach here I believe is uninterpreted functions, so that SMT is largely only capable of applying the commutativity, associativity, distributivity, and other axioms you put into it.

glangdale · on Nov 25, 2020

I've done some very, very preliminary work to pave the way towards iteration and recursion in superoptimizers, but you are correct that "straight-line over tiny problems" is mostly where it's at.

Even multiplication ops are a real bear for SMT. I can get them to work, but they are way more expensive than other arithmetic.

From what I understand, the whole 'approximate with uninterpreted functions' stuff can be done under the hood by SMT solvers during bit-blasting - from what I understand, more expensive operations can be expanded lazily (and until then treated as UFs) in the hope that useful properties emerge. Logics that have floating point theories do exist in the SMT benchmarks but like a big coward I haven't played with them.

You can't entirely ignore cost issues in a superoptimizer - the difference between, say, a latency = 3 instruction like a multiply and a latency = 1 operation might be decisive. But (a) there are ways to model this in your optimization process and (b) fine differences of cost aren't the primary issue when superoptimizing anyhow. The later point is decisive - I have a benchmark (on bytewise set membership) in my superoptimizer where I example 25 million 4-instruction specializations and there are only 7 solutions. I think I can spend a few extra cycles evaluating with of those 7 will be faster under a range of transforms, etc etc - the expensive part was finding 7/25M solutions to begin with (or, as is often the case, finding 0/25M solutions).