> you can write IR code that multiplies _any_ number of floats and the backend "...

> you can write IR code that multiplies _any_ number of floats and the backend "should" generate reasonable machine code for any architecture.

Except that it doesn't. If you write your code with <2 x float> and codegen for SSE, you'll get <4 x float> code, with two elements going unused. It's functionally correct, but you're potentially missing out on half the throughput. If you write your code with <8 x float>, you'll get two registers for each value, but this can create extra register pressure without actually giving you any increased throughput in return.