> you can write IR code that multiplies _any_ number of floats and the backend "should" generate reasonable machine code for any architecture.
Except that it doesn't. If you write your code with <2 x float> and codegen for SSE, you'll get <4 x float> code, with two elements going unused. It's functionally correct, but you're potentially missing out on half the throughput. If you write your code with <8 x float>, you'll get two registers for each value, but this can create extra register pressure without actually giving you any increased throughput in return.
Except that it doesn't. If you write your code with <2 x float> and codegen for SSE, you'll get <4 x float> code, with two elements going unused. It's functionally correct, but you're potentially missing out on half the throughput. If you write your code with <8 x float>, you'll get two registers for each value, but this can create extra register pressure without actually giving you any increased throughput in return.