As the other reply states that is effectively the Qualcomm proposal though note ...

londons_explore · on Oct 24, 2023

By requiring alignment, you can halve or more the size of the identifier.

Since if you have a 16 bit instruction, you know that it must be followed by another 16 bit instruction. Therefore, that 2nd instruction doesn't need the identifying bits. Or, more precisely, within a 32 bit slot, the 2^32 instructions possible need to be divided - and one way to do that is 2^31+2^30 possible 32 bit instructions, and 2^15 * 2^15 16 bit instructions. Now, the 16 bit instructions are only taking 25%, not 75% of the instruction space.

Joker_vD · on Oct 24, 2023

But now you have two kinds of 16-bit instructions, the ones for the leading position and the ones for the trailing position, and the latter ones have slightly more available functionality, right? Personally, at this point I'd think the decoder must already be complicated enough (it has either to maintain "leading/trailing/full" state between the cycles, or to decode 8/16-byte long batches at once) that you could simply give up and go for an encoding with completely irregular lengths à la x86 without much additional cost.

hajile · on Oct 24, 2023

Not necessarily.

    000x -- 64-bit instruction that uses 60 bits
    001x -- reserved
    010x -- reserved
    011x -- reserved
    100x -- two 32-bit instructions (each 30-bits)
    101x -- two 16-bit instructions then one 32-bit instruction
    110x -- one 32-bit instruction then two 16-bit instructions
    111x -- four 16-bit instructions (each 15 bits)
    xxx1 -- explicitly parallel
    xxx0 -- not explicitly parallel

Alternatively, you view them as VLIW instruction sets. This has the additional potential advantage of some explicitly parallel instructions when convenient.

phire · on Oct 24, 2023

One advantage of just sticking with only 32bit instructions is that nobody needs to write packet-aware instruction scheduling.

Even with decent instruction scheduling, you are still going to end up with a bunch of instruction slots filled with nops.

And it will be even worse if you take the next step to make it VLIW and require static scheduling within a packet.

Joker_vD · on Oct 24, 2023

In this case, it's probably not that bad as with actual VLIW: if you see that e.g. your second 16-bit instruction has to be a NOP, you just use a single 32-bit instruction instead; similarly for 32- and 64-bit mixes.

hajile · on Oct 24, 2023

The packet would be external and always fit in a cache line. You'd specify the exact instruction using 16-bit positioning. The fetcher would fetch the enclosing 64-bit group, decode, then jump to the proper location in that group.

In the absolute worst-case scenario where you are blindly jumping to the 16-bit instruction in the 4th position, you only fetch 2-3 unnecessary instructions. Decoders do get a lot more interesting on the performance end as each one will decode between 1 and 4 instructions, but this gets offset by the realization that 64-bit instructions will be used by things like SIMD/vector where you already execute fewer instructions overall.

The move to 64-bit groups also means you can increase cache size without blowing out your latency.

VLIW doesn't mean strictly static scheduling. Even Itanic was just decoding into a traditional backend by the time it retired. You would view it more as optional parallelism hints when marked.

I'd also note that it matches up with VLIW rather well. 64-bit instructions will tend to be SIMD instructions or very long jumps. Both of these are fine without VLIW.

Two 32-bit instructions make it a lot easier to find parallelism and they have lots of room to mark exactly when they are VLIW and when they are not. One 32-bit with two 16-bit still gives the 32-bit room to mark if it's VLIW, so you can turn it off on the worst cases.

The only point where it potentially becomes hard is four 16-bit instructions, but you can either lose a bit of density switching to the 32+16+16 format to not be parallel or you can use all 4 together and make sure they're parallel (or add another marker bit, but that seems like its own problem).

phire · on Oct 24, 2023

I think if you have 64bit packets, you might as well align jump targets to the 64bit boundary.

I'd rather have an extra nop or two before jump targets than blindly throw 1-3 instructions worth of decoding bandwidth on jumps (which are often hot)

hajile · on Oct 24, 2023

If you're fetching 128-bit cache lines, you're already "wasting" cache. Further, decoding 1-3 NOP instructions isn't much different from decoding 1-3 extra instructions except that it adversely affects total code density.

If you don't want to decode the extra instructions, you don't have to. If the last 2 bits of the jump are zero, you need the whole instruction block. If the last bit is zero, jump to the 35th bit and begin decoding while looking at the first nibble to see if it's a single 32-bit instruction or two 16-bit instructions. And finally, if it ends with a 1, it's the last instruction and must be the last 15 bits.

All that said, if you're using a uop cache and aligning it with I-cache, you're already going to just decode all the things and move on knowing that there's a decent chance you jump back to them later anyway.

phire · on Oct 25, 2023

But if you don't have a uop cache (which is quite feasible with a RISC-V or AArch64 style ISA), then decode bandwidth is much more important than a few NOPs in icache.

Presumably your high performance core has at least three of these 64bit wide decoders, for a frontend that takes a 64bit aligned 192bit block every cycle and decodes three 64bit instructions, six 32bit instructions, twelve 16bit instructions, or some combination of all sizes every cycle.

If you implement unaligned jump targets, then the decoders still need to fetch 64bit aligned blocks to get the length bits. For every unaligned jump, that's upto a third of your instruction decode slots site idle for the first cycle. This might mean the difference between executing a tight loop in one cycle or two.

A similar thing applies to a low gate count version of the core, a design where your instruction decoder targets one 32bit or 16bit instruction per cycle (and a 64bit instruction every second cycle). On unaligned jumps, such a decoder still needs to load the first 32bits of the instruction first to check the length decoding, and waste an entire cycle on every single branch.

Allowing unaligned jump targets might keep a few NOPs out of icache (depending on how good the instruction scheduler is), but it costs you cycles in tight branchy code.

Knowing compiler authors, if you have this style of ISA and even it does support unaligned jump targets, they are still going to default to inserting NOPs to align every single jump target, just because the performance is notably better on aligned jump targets and they have no idea if this branch target is hot or cold.

So my argument is that you might as well enforce jump target alignment of 64 bits anyway. Allow all implementations gain the small wins from assuming that all targets are 64bit aligned, and use the 2 extra bits to make your relative jump instructions have four times as much range.

hajile · on Oct 25, 2023

Which is easier to decode?

[jmp nop nop], [addi xxx]

OR

[xxx jmp], [nop nop addi]

OR

[xxx jmp], [unused, addi]

All of these tie up your entire decoder, but some tie it up with potentially useful information. That seems superior to me.

phire · on Oct 27, 2023

It's only unconditional jumps that might have NOPs following them.

For conditional jumps (which are pretty common), the extra instructions in the packet will be executed whenever the branch isn't taken.

And instruction scheduling can actually do some optimisation here. If you have a loop with an unconditional jump at the end and an unaligned target, you can do partial loop unrolling, for example:

With [xxx, inst_1, inst_2], [inst3]...(loop body) ...[jmp to inst_1, nop, nop], you can repack the final jump packet as [inst_1, inst_2, jump to inst_3]

This partial loop unrolling actually is much better for performance than not wasting I-cache as it has reduces the number of instruction decoder packets per iteration by one. Compilers will implement this anyway, even if you do support mid-packet jump targets.

Finally, compilers already tend to put nops after jumps and returns on current ISAs, because they want certain jump targets (function entry points, jump table entries) to be aligned to cache lines.

KMag · on Oct 24, 2023

Don't forget the possibility of 5x 12-bit instructions. In particular, if you have only one or two possibilities for destination registers for each of the 5 positions (so an accumulator-like model), you could still have a quite useful set of 12-bit instructions.

namibj · on Oct 24, 2023

No, the idea was to say "prefix 0"=>31bit, "prefix 10"=>30bit, "prefix 11"=>2*15bit. If you need you can split the two bits to have the two 15 bit chunks aligned identically.