Superoptimizer – A Look at the Smallest Program [pdf] (1987)

glangdale · on Nov 24, 2020

This is a cool paper. I'm building a SMT-based superoptimizer for SIMD sequences that is still pretty stupid (it can only handle "lanewise" SIMD i.e. some op on each lane, it can only do AVX2, it can only handle arbitrary graphs up to 4 instructions before it gets too slow).

And yet... it concocted a good chunk of a PSHUFB based pattern matcher (effectively "shufti" from https://github.com/intel/hyperscan) that I invented in 2009. If I could push further on this and scale from 4 instructions handled in about 45 minutes (single core, though it scales well) up to 5 instructions (currently takes a month of single-core search time) to 6-7 instructions (I don't even know!), this system would be able to come up with the full version of shufti, a "quite commercially relevant" SIMD sequence that I invented in 2009.

Not saying that sequence was the cleverest thing ever, but it seemed novel enough, and to have a machine just spew it out when given the black box description of what it's meant to achieve... it would be pretty cool.

All I need is a few orders of magnitude worth of search optimizations. That's OK - when I started, my superoptimizer took a couple days to handle 4 instructions, so it might still happen.

tikhonj · on Nov 25, 2020

I worked on an SMT-based superoptimizer research project[1] as an undergrad. We were targeting a weird stack-based chip so the details were pretty different, but it worked surprisingly well. Really fun project for sure!

SMT solvers are an incredible general-purpose tool for theses things, and we got quite far through various encoding tricks like synthesizing with much smaller bit-widths and then verifying with full widths. However, my impression is that while SMT solvers are great starting out, at some point the overhead and complexity of relying on encoding tricks starts to outweigh the advantages, to the point where either an entirely custom algorithm or some mix of existing solvers with a custom algorithm are the way to go. I don't know if that's fundamental, or it's just a function of people being able to understand their own algorithms better than SMT solvers—most people I talked to saw SMT solver performance as a black box.

I remember reading about the Sketch synthesizer, which used a custom solver as its backend. It comes to mind because the first example of using Sketch[2] that our professor showed us involved SIMD shuffling instructions. It's probably worth checking out on its own—I remember its input language being pretty easy to use—but I can't find the project itself anywhere any more, probably because it used to be hosted on BitBucket :/.

[1]: https://jelv.is/chlorophyll.pdf

[2]: https://homes.cs.washington.edu/~bodik/ucb/cs294/fa12/Lectur...

glangdale · on Nov 25, 2020

I started with a solver-heavy approach whereby the SMT solver was trusted to come up with all sorts of things. By a process of steady refinement, my experience is that if there's a change whereby you trust the solver less and your own optimizations more, you do it. So now I call the SMT solver a gazillion times with much more simple problems. I also use the SMT solver in a bunch of different ways to invent general and specific (i.e. knowing the workload) optimizations for the problem which I then can apply without intervention of a solver.

I may ultimately wind up throwing SMT under the bus and doing my own bit-blasting; QF_BV ("the quantifier-free bit-vector logic") is not one of SMT's more difficult corners, and I may get some benefit in specializing my own bit-blast and talking straight to a SAT solver.

I have been doing a fair bit of work to try to reduce the 'black box' aspect of SMT solver performance, but SMT is hard to get a handle on it. One minute, you're solving Sudoko and logic problems ("The professor lives in the same street as the baker, while the man who lives in the blue house does not ride a bike, ..."). The next, staring at pages and pages of math that's aimed at people who want to be researchers in SMT - there's not much intermediate-level stuff aimed at the practitioners.

jcranmer · on Nov 24, 2020

As superoptimizer research papers go, this is pretty ancient and very far behind the state-of-the-art. The paper describing STOKE (https://theory.stanford.edu/~aiken/publications/papers/asplo...) is what I would recommend as the starting point instead.

glangdale · on Nov 25, 2020

Sorry, my original post was confusing. I did not meant to imply that I based my work on the Massalin paper, just that I liked it. As I said, my work is SMT-based and more closely related to Regehr's 'souper' or the one described in Gulwani's "Synthesis of Loop Free Programs" in PLDI '11 (I think).

While STOKE is interesting, I have trouble understanding why they even call it a "superoptimizer". The point of appending "super-" to optimizer was to get away from the (strictly speaking incorrect) practice of referring to compiler improvements as "optimizations" (when they don't achieve optimal outcomes).

STOKE is "stochastic", so it cannot realistically claim to always find optimal sequences. So it's just another "improver", or if you will, an "optimizer".

sitkack · on Nov 25, 2020

Seems like it might be faster to autogenerate valid instruction streams, trace them and then feed the traces into an ML model and have it generate the fastest sequences. Exercise left to the ...

glangdale · on Nov 25, 2020

The big gap here is finding instruction streams that match the thing you are trying to do. I am not aware of any obvious way to autogenerate these short of the typical superoptimizer work (usually brute force or SMT).

nextaccountic · on Nov 25, 2020

Do you have a site or paper about your SMT-based superoptimizer? Or maybe even a Github.

glangdale · on Nov 25, 2020

Not yet. I have a (currently neglected) blog at branchfree.org, but haven't yet posted about my superoptimizer. Certainly no paper yet.

I may well be churlish enough not to make my superoptimizer open source, as I'm still daydreaming about doing mildly commercial things with it.

eslaught · on Nov 24, 2020

See also:

Conditionally Correct Superoptimization https://theory.stanford.edu/~aiken/publications/papers/oopsl...

Stochastic Superoptimization https://theory.stanford.edu/~aiken/publications/papers/asplo...

Automatic Generation of Peephole Superoptimizers https://theory.stanford.edu/~aiken/publications/papers/asplo...

read_if_gay_ · on Nov 24, 2020

https://github.com/google/souper

WalterBright · on Nov 25, 2020

This paper changed compiler code generators, the sequences it discovered were rapidly adopted. It's nice to see it again.

jl6 · on Nov 24, 2020

Am I reading this right that this process optimizes for program length but not necessarily program performance?

eslaught · on Nov 24, 2020

This is a very old paper that, like a sibling comment indicated, served mainly to introduce the idea and show proof of concept.

If you see my other top-level comment, there are papers that have advanced this idea rather considerably since then. In particular one of the more promising recent implementations is "stochastic" superoptimization where you view it as a search problem instead of brute force search. One of the great things about this is the search doesn't need to be perfectly accurate: e.g., you can evaluate only the 10% most promising programs via benchmarks to determine how fast they are, using a much faster (but less accurate) method of counting instructions for the other 90%.

DoofusOfDeath · on Nov 24, 2020

(Apologies because I don't have time to read that paper right now, but...)

Program length can be evaluated (more or less) without ambiguity, non-determinism, or the need for actual benchmarking.

In contrast, program performance (in practice) is hard to accurately predict via model, and on a typical computer system is affected by a lot of hidden initial state and other random variables.

So unless you're content to use a simplistic model to grade a program's performance, I would guess that searching over that space would require actual benchmarking, which could significantly slow the search process. (Although various techniques like ceasing evaluation after a program takes longer than the best-so-far might help.)

jcranmer · on Nov 24, 2020

If you exclude branches, integer division, and memory instructions, it's easy to perfectly model your program's performance. Even for memory, if you assume it's all L1 cache hits, performance is again perfectly predictable--and if it's not all L1 cache hits, the optimization of the surrounding code basically doesn't matter (not on the level of superoptimization).

The state-of-the-art for superoptimization would be Aiken's work (linked elsewhere in the comments). The hardest part here is building the formal specifications of every single instruction, but their group also showed that it's possible to automatically generate these from a small seed set by similar techniques.

DoofusOfDeath · on Nov 24, 2020

> If you exclude branches, integer division, and memory instructions, it's easy to perfectly model your program's performance.

Assuming you're right about that, those seem to me like very big caveats.

Also, wouldn't thermals-based clock throttling be a factor as well, assuming that we're talking about wall-clock performance instead of clock-cycles performance? I'd be surprised if their impact on program performance is easy to perfectly model.

jcranmer · on Nov 24, 2020

Superoptimization looks mainly at peephole optimizations. This is looking at small fragments of code, where you're trying to shave off one or two cycles. STOKE (which is the superoptimizer I alluded to in my comment) scales to dozens of instructions, and still requires loop-free code (because SMT solvers don't handle loops at all), and I'm not sure it handles branch cases more complicated than nested if-else [1].

The correctness oracles we need for superoptimization requires use of formal methods; basically SMT libraries these days. And modern SMT libraries tend to barely work when you deal with division (and remainder) operations, or if you use floating-point code [2].

So, essentially, the size of the programs that you can apply to this code generally means you can get away with ignoring these issues for such codes. The biggest exception is branch prediction, where a cmov is better than a branch if and only if the condition is unpredictable.

> Also, wouldn't thermals-based clock throttling be a factor as well, assuming that we're talking about wall-clock performance instead of clock-cycles performance?

Clock cycle performance will largely be monotonic in wall-clock performance, especially at the level of peephole optimization. You're not adjusting the amount of memory traffic, or other microarchitectural state significantly. The difference in energy between using a shift instruction versus a multiply versus an add isn't going to cause enough thermal difference to throttle the fans. In order to really increase energy usage enough to change frequency, you'd have to go from unvectorized code to using largest register width, and that just isn't going to happen in a superoptimizer--your loop and SLP vectorizer will be the one to change that. And once you've triggered the AVX-512-heavy frequency level, you're still going to want to use the variant of that that uses fewer clock cycles anyways.

[1] It might work based on enumerating every path via path conditions, which could handle threaded jumps, but means complexity is roughly exponential in the number of conditions. Given that we're not really pushing hundreds of instructions yet, that means that branching has to be very minimal in anything our current engines can work with.

[2] The typical approach here I believe is uninterpreted functions, so that SMT is largely only capable of applying the commutativity, associativity, distributivity, and other axioms you put into it.

glangdale · on Nov 25, 2020

I've done some very, very preliminary work to pave the way towards iteration and recursion in superoptimizers, but you are correct that "straight-line over tiny problems" is mostly where it's at.

Even multiplication ops are a real bear for SMT. I can get them to work, but they are way more expensive than other arithmetic.

From what I understand, the whole 'approximate with uninterpreted functions' stuff can be done under the hood by SMT solvers during bit-blasting - from what I understand, more expensive operations can be expanded lazily (and until then treated as UFs) in the hope that useful properties emerge. Logics that have floating point theories do exist in the SMT benchmarks but like a big coward I haven't played with them.

You can't entirely ignore cost issues in a superoptimizer - the difference between, say, a latency = 3 instruction like a multiply and a latency = 1 operation might be decisive. But (a) there are ways to model this in your optimization process and (b) fine differences of cost aren't the primary issue when superoptimizing anyhow. The later point is decisive - I have a benchmark (on bytewise set membership) in my superoptimizer where I example 25 million 4-instruction specializations and there are only 7 solutions. I think I can spend a few extra cycles evaluating with of those 7 will be faster under a range of transforms, etc etc - the expensive part was finding 7/25M solutions to begin with (or, as is often the case, finding 0/25M solutions).

superjan · on Nov 24, 2020

Yes. I think that this paper’s contribution is not that it pretends to be the final answer, but that this technique is even feasible at all.

monocasa · on Nov 24, 2020

It's from an era when program length was strongly correlated with program performance. And when it wasn't it was still a fixed known cost per instruction type, and could be statically computed.

djmips · on Nov 26, 2020

I was kind of nodding and agreeing with you but then I remembered a bunch of optimal code from back in those days that was very unwound and super large so it depended on the type of work being done. I guess one could say that purely algorithmic code tended to have size and speed correlated whereas code that was bottlenecked in memory read/write could benefit from being large. Cache wasn't having the big impact like today.

WillFlux · on Nov 25, 2020

“This method has been used by the author (along with another program that optimizes code emulating state machines) to write the C library function printf in only 500 bytes.”

I wonder how big libc would have been if they applied Superoptimizer to all its functions.

mikequinlan · on Nov 24, 2020

1987 and I would love to see the source code.

jhoechtl · on Nov 25, 2020

In 1987 we had beautifully typeset papers. A craft which got apparently lost.

nextaccountic · on Nov 25, 2020

In which ways this paper is better typeset than the typical LaTeX paper?

jacobush · on Nov 25, 2020

I think that might be Tex (or LaTex). But maybe OP referenced all the Word papers out there?

jhoechtl · on Nov 25, 2020

Correct

rcgorton · on Nov 24, 2020

It is now Alexia Massalin

taliesinb · on Nov 24, 2020

Brute force search of the computational universe of simple programs for useful ones – that rings a bell, but I can’t think where I have seen that before!