The performance problems originate in the design of your trace compiler, not in static vs. dynamic dispatch. Some suggestions:
* The interpreter should have a fast profiling mode (hashed counting of loop backedges) and a slower recording mode (for every instruction call the recorder first, then execute the instruction). Either implement it twice (it's small enough), use a modifiable dispatch table and intercept the dispatch in recording mode (indirect threading), or compute branch offsets relative to a base (change the base to switch modes).
* Don't record traces for a long time and then compile everything together. Do it incrementally:
- Detect a hot loop, switch into recording mode, record a trace, compile it, attach it to the bytecode, switch to profiling mode (which may call your compiled trace right away).
- Make the side exits branch to external stubs which do more profiling (one counter per exit). Start recording hot traces and continue until it hits an existing trace or abort if it hits an uncompiled loop.
- If you completely control the machine code generation (i.e. not with LLVM), you can attach the side traces to their branch points by patching the machine code. Otherwise you may need to use indirections or recompile clusters of the graph after a certain threshold is reached.
- Region selection has a major impact on performance, so be prepared to carefully tune the heuristics.
* Sink all stores, especially updates of the virtual PC, data pointers etc. Don't count on the optimizer to do this for you.
* Due to the nature of the source language you may need to preprocess the IR or you need to teach the optimizer some additional tricks.
- E.g. the loop [-] should really be turned into 'data[0] = 0'.
- Or the loop [->>>+<<<] should be turned into 'data[3] += data[0]; data[0] = 0'.
- It's unlikely any optimizer handles all of these cases, since no sane programmer would write such code ... oh, wait. :-)
> * The interpreter should have a fast profiling mode (hashed counting of loop backedges) and a slower recording mode (for every instruction call the recorder first, then execute the instruction).
It already does this. The recording method is specialized for '[' (since loop headers can only be '['). All other opcodes go through a fast path that simply checks if we're in recording mode and stores to the trace buffer.
> * Don't record traces for a long time and then compile everything together.
The tricky part with this is knowing how to start up the profiler when we hit a side-exit. PC 123 may occur at multiple places in the trace tree. If we want to extend the tree on side-exit, we need to be able to recreate the path through the trace tree that led to that point. In essence, we need the compiled trace to continue updating the trace buffer. Certainly possible, but doesn't seem like a great idea offhand.
> * Sink all stores, especially updates of the virtual PC, data pointers etc. Don't count on the optimizer to do this for you.
Because I'm using tail-call based direct threading, there are no stores to the virtual PC or the data pointer. They're passed in registers to the tail-callee.
> * Due to the nature of the source language you may need to preprocess the IR or you need to teach the optimizer some additional tricks.
Yes, there's a whole range of pre-processing tricks that could be used to accelerate both the interpreter and the traces. I haven't even scratched the surface of that.
Hefty! As an exercise, any sense in inlining the tracing into compiled code? That way you sacrifice the startup time benefits of the JIT (by AOT compiling at start) but still retain trace specialization.
You could certainly do that, though the interface between statically JIT'd code and trace JIT'd code would get hairy. Plus, as I said above, there's simply not a whole lot of trace specialization going on in the first place, since Brainfuck lacks dynamic dispatch.