The packet would be external and always fit in a cache line. You'd specify the e...

phire · on Oct 24, 2023

I think if you have 64bit packets, you might as well align jump targets to the 64bit boundary.

I'd rather have an extra nop or two before jump targets than blindly throw 1-3 instructions worth of decoding bandwidth on jumps (which are often hot)

hajile · on Oct 24, 2023

If you're fetching 128-bit cache lines, you're already "wasting" cache. Further, decoding 1-3 NOP instructions isn't much different from decoding 1-3 extra instructions except that it adversely affects total code density.

If you don't want to decode the extra instructions, you don't have to. If the last 2 bits of the jump are zero, you need the whole instruction block. If the last bit is zero, jump to the 35th bit and begin decoding while looking at the first nibble to see if it's a single 32-bit instruction or two 16-bit instructions. And finally, if it ends with a 1, it's the last instruction and must be the last 15 bits.

All that said, if you're using a uop cache and aligning it with I-cache, you're already going to just decode all the things and move on knowing that there's a decent chance you jump back to them later anyway.

phire · on Oct 25, 2023

But if you don't have a uop cache (which is quite feasible with a RISC-V or AArch64 style ISA), then decode bandwidth is much more important than a few NOPs in icache.

Presumably your high performance core has at least three of these 64bit wide decoders, for a frontend that takes a 64bit aligned 192bit block every cycle and decodes three 64bit instructions, six 32bit instructions, twelve 16bit instructions, or some combination of all sizes every cycle.

If you implement unaligned jump targets, then the decoders still need to fetch 64bit aligned blocks to get the length bits. For every unaligned jump, that's upto a third of your instruction decode slots site idle for the first cycle. This might mean the difference between executing a tight loop in one cycle or two.

A similar thing applies to a low gate count version of the core, a design where your instruction decoder targets one 32bit or 16bit instruction per cycle (and a 64bit instruction every second cycle). On unaligned jumps, such a decoder still needs to load the first 32bits of the instruction first to check the length decoding, and waste an entire cycle on every single branch.

Allowing unaligned jump targets might keep a few NOPs out of icache (depending on how good the instruction scheduler is), but it costs you cycles in tight branchy code.

Knowing compiler authors, if you have this style of ISA and even it does support unaligned jump targets, they are still going to default to inserting NOPs to align every single jump target, just because the performance is notably better on aligned jump targets and they have no idea if this branch target is hot or cold.

So my argument is that you might as well enforce jump target alignment of 64 bits anyway. Allow all implementations gain the small wins from assuming that all targets are 64bit aligned, and use the 2 extra bits to make your relative jump instructions have four times as much range.

hajile · on Oct 25, 2023

Which is easier to decode?

[jmp nop nop], [addi xxx]

OR

[xxx jmp], [nop nop addi]

OR

[xxx jmp], [unused, addi]

All of these tie up your entire decoder, but some tie it up with potentially useful information. That seems superior to me.

phire · on Oct 27, 2023

It's only unconditional jumps that might have NOPs following them.

For conditional jumps (which are pretty common), the extra instructions in the packet will be executed whenever the branch isn't taken.

And instruction scheduling can actually do some optimisation here. If you have a loop with an unconditional jump at the end and an unaligned target, you can do partial loop unrolling, for example:

With [xxx, inst_1, inst_2], [inst3]...(loop body) ...[jmp to inst_1, nop, nop], you can repack the final jump packet as [inst_1, inst_2, jump to inst_3]

This partial loop unrolling actually is much better for performance than not wasting I-cache as it has reduces the number of instruction decoder packets per iteration by one. Compilers will implement this anyway, even if you do support mid-packet jump targets.

Finally, compilers already tend to put nops after jumps and returns on current ISAs, because they want certain jump targets (function entry points, jump table entries) to be aligned to cache lines.