Hacker News new | past | comments | ask | show | jobs | submit login

> you can write IR code that multiplies _any_ number of floats and the backend "should" generate reasonable machine code for any architecture.

Except that it doesn't. If you write your code with <2 x float> and codegen for SSE, you'll get <4 x float> code, with two elements going unused. It's functionally correct, but you're potentially missing out on half the throughput. If you write your code with <8 x float>, you'll get two registers for each value, but this can create extra register pressure without actually giving you any increased throughput in return.




Other optimization passes will widen the vecorization width to 4 if possible by unrolling the loop!




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: