The packet would be external and always fit in a cache line. You'd specify the exact instruction using 16-bit positioning. The fetcher would fetch the enclosing 64-bit group, decode, then jump to the proper location in that group.
In the absolute worst-case scenario where you are blindly jumping to the 16-bit instruction in the 4th position, you only fetch 2-3 unnecessary instructions. Decoders do get a lot more interesting on the performance end as each one will decode between 1 and 4 instructions, but this gets offset by the realization that 64-bit instructions will be used by things like SIMD/vector where you already execute fewer instructions overall.
The move to 64-bit groups also means you can increase cache size without blowing out your latency.
VLIW doesn't mean strictly static scheduling. Even Itanic was just decoding into a traditional backend by the time it retired. You would view it more as optional parallelism hints when marked.
I'd also note that it matches up with VLIW rather well. 64-bit instructions will tend to be SIMD instructions or very long jumps. Both of these are fine without VLIW.
Two 32-bit instructions make it a lot easier to find parallelism and they have lots of room to mark exactly when they are VLIW and when they are not. One 32-bit with two 16-bit still gives the 32-bit room to mark if it's VLIW, so you can turn it off on the worst cases.
The only point where it potentially becomes hard is four 16-bit instructions, but you can either lose a bit of density switching to the 32+16+16 format to not be parallel or you can use all 4 together and make sure they're parallel (or add another marker bit, but that seems like its own problem).
If you're fetching 128-bit cache lines, you're already "wasting" cache. Further, decoding 1-3 NOP instructions isn't much different from decoding 1-3 extra instructions except that it adversely affects total code density.
If you don't want to decode the extra instructions, you don't have to. If the last 2 bits of the jump are zero, you need the whole instruction block. If the last bit is zero, jump to the 35th bit and begin decoding while looking at the first nibble to see if it's a single 32-bit instruction or two 16-bit instructions. And finally, if it ends with a 1, it's the last instruction and must be the last 15 bits.
All that said, if you're using a uop cache and aligning it with I-cache, you're already going to just decode all the things and move on knowing that there's a decent chance you jump back to them later anyway.
But if you don't have a uop cache (which is quite feasible with a RISC-V or AArch64 style ISA), then decode bandwidth is much more important than a few NOPs in icache.
Presumably your high performance core has at least three of these 64bit wide decoders, for a frontend that takes a 64bit aligned 192bit block every cycle and decodes three 64bit instructions, six 32bit instructions, twelve 16bit instructions, or some combination of all sizes every cycle.
If you implement unaligned jump targets, then the decoders still need to fetch 64bit aligned blocks to get the length bits. For every unaligned jump, that's upto a third of your instruction decode slots site idle for the first cycle. This might mean the difference between executing a tight loop in one cycle or two.
A similar thing applies to a low gate count version of the core, a design where your instruction decoder targets one 32bit or 16bit instruction per cycle (and a 64bit instruction every second cycle). On unaligned jumps, such a decoder still needs to load the first 32bits of the instruction first to check the length decoding, and waste an entire cycle on every single branch.
Allowing unaligned jump targets might keep a few NOPs out of icache (depending on how good the instruction scheduler is), but it costs you cycles in tight branchy code.
Knowing compiler authors, if you have this style of ISA and even it does support unaligned jump targets, they are still going to default to inserting NOPs to align every single jump target, just because the performance is notably better on aligned jump targets and they have no idea if this branch target is hot or cold.
So my argument is that you might as well enforce jump target alignment of 64 bits anyway. Allow all implementations gain the small wins from assuming that all targets are 64bit aligned, and use the 2 extra bits to make your relative jump instructions have four times as much range.
It's only unconditional jumps that might have NOPs following them.
For conditional jumps (which are pretty common), the extra instructions in the packet will be executed whenever the branch isn't taken.
And instruction scheduling can actually do some optimisation here. If you have a loop with an unconditional jump at the end and an unaligned target, you can do partial loop unrolling, for example:
With [xxx, inst_1, inst_2], [inst3]...(loop body) ...[jmp to inst_1, nop, nop], you can repack the final jump packet as [inst_1, inst_2, jump to inst_3]
This partial loop unrolling actually is much better for performance than not wasting I-cache as it has reduces the number of instruction decoder packets per iteration by one. Compilers will implement this anyway, even if you do support mid-packet jump targets.
Finally, compilers already tend to put nops after jumps and returns on current ISAs, because they want certain jump targets (function entry points, jump table entries) to be aligned to cache lines.
In the absolute worst-case scenario where you are blindly jumping to the 16-bit instruction in the 4th position, you only fetch 2-3 unnecessary instructions. Decoders do get a lot more interesting on the performance end as each one will decode between 1 and 4 instructions, but this gets offset by the realization that 64-bit instructions will be used by things like SIMD/vector where you already execute fewer instructions overall.
The move to 64-bit groups also means you can increase cache size without blowing out your latency.
VLIW doesn't mean strictly static scheduling. Even Itanic was just decoding into a traditional backend by the time it retired. You would view it more as optional parallelism hints when marked.
I'd also note that it matches up with VLIW rather well. 64-bit instructions will tend to be SIMD instructions or very long jumps. Both of these are fine without VLIW.
Two 32-bit instructions make it a lot easier to find parallelism and they have lots of room to mark exactly when they are VLIW and when they are not. One 32-bit with two 16-bit still gives the 32-bit room to mark if it's VLIW, so you can turn it off on the worst cases.
The only point where it potentially becomes hard is four 16-bit instructions, but you can either lose a bit of density switching to the 32+16+16 format to not be parallel or you can use all 4 together and make sure they're parallel (or add another marker bit, but that seems like its own problem).