Hacker News new | past | comments | ask | show | jobs | submit login

For rvv specifically there are a few things that aren't possible using the intrinsics abstraction.

E.g. in asm you can run the same instruction sequence with different vtype (element width and LMUL).




We are able to do the same with Highway's RVV :)


I believe what camel-cdr is saying is being able to run the same code without duplication (say, a loop) which has no vsetvl-s inside, by conditionally choosing either an initial "vsetvli x0,x0,e32,m1" or "vsetvl x0,x0,e16,m1" or "vsetvl x0,x0,e32,m2" etc, which is just unachievable with the intrinsics as they hard-code vtype in each intrinsic.

It's an extremely fun idea (primarily just for code size though), but thankfully (?) its usability is restricted by load/store instrs hard-coding the element type, so the main use of this would end up for switching LMUL, which has very limited usefulness.

What Highway can support is generating multiple loops of different vtype from the same code, which effectively achieves the same thing, at the cost of machine code duplication.


> for switching LMUL, which has very limited usefulness

I currently have a quite usefull use case for it, I'm concerting utf8 to utf32 and if I've got an average utf8 character size of above 2 I could reduce the LMUL for that loop iteration.

This shouldn't actually improve performance that much in good rvv implementations, since you can use vl and not LMUL to schedule your execution units. Sadly this is currently not the standard, and ara is the only implementation, that does this I know of.

I think this wouldn't even be about code size reduction, consider an input, where there is basically a 50/50 probability LMUL can be reduced, that would be horrible for the branch predictor, but with only a branch over vsetvl, this could behave as a conditional vsetvl via instruction fusion. We'll have to see if such optimization become relevant once there is more hardware out there.


That's an interesting use-case, though I wouldn't be surprised if some impls really wouldn't like LMUL dynamically switching at runtime a lot (i.e. something like LMUL being forwarded at decode-time, so it couldn't decode after an unknown-LMUL vsetvl, ruining perf)


Oh, I see, thanks for clarifying. Yes, I was referring only to "same source code" and agree our approach would generate multiple copies of the instructions.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: