I can think of a hack that could be faster, but wouldn't be very portable... Probably won't have the time to try it out though: JIT the `needle` text into a bunch of dynamically generated functions matching the characters at specific position, one "miss" (skip len(needle) chars) function and a jump table. Sure - it will need a `255 x len(needle)` jumptable, but 99% of the addresses will point at the "miss" case and you don't need that many cache pages if you're streaming through gigabytes of data. So basically it would be a JIT-unrolled BM.
It would be a "jumps -vs- mispredicted branches" cost challenge, but I'm not sure which would win...
It would be a "jumps -vs- mispredicted branches" cost challenge, but I'm not sure which would win...