SIMD Everywhere Optimization from ARM Neon to RISC-V Vector Extensions

atdt · on Sept 29, 2023

Highway (https://github.com/google/highway), Google's SIMD library, lets you write length-agnostic SIMD code. It has excellent support for a wide range of targets, including both RISC-V and Arm vector extensions.

synergy20 · on Sept 30, 2023

second this

biocrusoe · on Sept 29, 2023

Archived copy: https://web.archive.org/web/20230929161438/https://arxiv.org...

Direct link to PDF: https://arxiv.org/pdf/2309.16509.pdf

almatabata · on Sept 29, 2023

Very neat, i hope this will get easier to do in the future once languages start including these SIMD semantics in the language itself like rust tries to do:

https://doc.rust-lang.org/std/simd/struct.Simd.html

Libraries implemented in languages without these semantics will greatly benefit from this.

dzaima · on Sept 29, 2023

A problem is that most such things (the rust thing, C++'s experimental/simd, Zig's SIMD types) have the vector size as a compile-time property, while ARM's SVE and RISC-V's RVV are designed such that it's possible to write portable code that can work for a range of implementation widths. Thus such a fixed-width SIMD library would be forced to target the minimum (128-bit) even if the hardware supports 256-bit, 512-bit, or more. (SVE supports up to 1024-bit, RVV - up to 65536-bit)

There is Highway (https://github.com/google/highway) however, that does support dynamically-sized SIMD.

vlovich123 · on Sept 30, 2023

What compile time languages tend to do is generate all the variants and then patch the one to use at startup (or if they don’t patch, they do a cheap dynamic dispatch since the CPU will predict through the branch and you’re going to be doing this selection exterior to the hot path or else you wouldn’t have bothered to vectorize in the first place).

For example:

https://docs.rs/simdeez/latest/simdeez/

almatabata · on Sept 29, 2023

If you compile the code by specifying the target as native you could get around that limitation no?

cozzyd · on Sept 29, 2023

yes, but then if distributing binaries, you need a different binary for each SIMD width.

vkazanov · on Sept 29, 2023

You can actually autogenerate all reasonable variants of the code if necessary, there aren't that many architectures these days. Simd imstructions are usually very local, this shouldn't blow up the binary.

The point is to not have to write repetitive source code many times.

Pet_Ant · on Sept 29, 2023

What is the real cost to just have those few methods be compiled in and then a branch? You don't need to ship a separate binary for each target, you can have dead code in it. I mean fat binaries take this idea to the extreme to support multiple architectures.

https://en.wikipedia.org/wiki/Fat_binary

dzaima · on Sept 29, 2023

That's 5 copies for SVE (had an error in first message - SVE allows up to 2048-bit vectors, not 1024), and 10 copies for RVV if you wanted to target all widths (though you'd probably be fine for a decade or a couple by targeting just 128 & 256-bit, and maybe 512-bit). Plus one more for a scalar fallback.

And yes, it's not particularly large of a cost, other than it being an extremely pointless waste of space given that it is possible to have just one variant that covers them all.

Though, it would become significantly more problematic if you wanted to target different extension groups too (which you would quite likely want to some extent) as those'd multiply with all the length targets - SVE vs SVE2 vs more future extensions, and on RVV there's just a lot (Zvfh & Zvfhmin for FP16, Zvbb for extra bitmanip stuff, many more here[1]; and potentially at some point there could be an extension that uses a wider encoding scheme to inline vsetvl fields & allow masking by registers other than v0, which could benefit everything)

[1]: https://github.com/riscv/riscv-crypto/blob/c8ddeb7e64a3444dd...

anonymoushn · on Sept 30, 2023

Prior to SVE2 string processing algorithms already had to ship a bunch of copies for SVE because it lacked a "shuffle the 16-byte lanes of A according to the shuffle in the 16-byte lanes of B" instruction :(

elabajaba · on Sept 29, 2023

Not being able to inline and having to branch on every call to a simd function can sometimes make it slower than the basic scalar version.

dzaima · on Sept 29, 2023

You sould branch at the level where inlining doesn't make sense, which would usually be some function wrapping the big loop, which should be rather free. Which is the same situation as on x86-64 if you want to target pre-AVX2/post-AVX2/AVX-512.

Pet_Ant · on Sept 29, 2023

Just a thought, but would it be possible to hot patch at the time of loading the binary? I realise it might require updates to the binary format, but it might be very well justified.

anonymoushn · on Sept 30, 2023

This sounds similar to shipping all these routines in dynamic libraries and loading the right one at runtime. I'm not sure if the memory cost of having all these duplicate simd kernels is a big deal though.

almatabata · on Sept 29, 2023

Ah makes sense if you have complete control over your hardware it could make sense but with open source projects and businesses with a wide customer base it might not make sense.

Compiled languages like Rust, C++ and Zig cannot detect the hardware because they have no runtime right? Could a language like Go add the simd semantics and detect the support vector size?

dzaima · on Sept 29, 2023

The problem isn't detecting the width (that's trivially possible at runtime with a single instruction, though both SVE and RVV have a way to write loops such that you don't even need to).

The problem is that a "Simd<i32, 4>" will always have 4 elements, but you'd need a "Simd<i32, whatever the hardware has>" type, which has significant impact on what is possible to do with such a type.

almatabata · on Sept 29, 2023

Ah thank you for clarifying so you would have to create an abstraction layer on top of the current simd implementation like for example simd_vector(type, size). That abstraction would have to dynamically detect the hardware and dispatch it to the hardware like the project you shared (https://github.com/google/highway).

So technically it sounds feasible but all of the languages like Zig, C++ and Rust picked a simpler approach. Is it simply a first step to a more abstract approach?

dzaima · on Sept 29, 2023

Not really - you don't need to dispatch anything, the idea is that the same code (and thus the same assembly/machine code) can operate on different sizes by itself. e.g. with RVV "vsetvli x0,x0,e32,m1,ta,ma; vadd.vv v0, v1, v2" on a CPU with 128-bit vectors will do 4 element additions, but on a CPU with 1024-bit vectors it'll do 32 additions.

And some things you just can't really "generalize" to scalable vectors. e.g. you can store Simd<i32,4> in a struct or global variables, or initialize with, say, [3,2,1,0], but none of those things are possible with scalable vectors (globals/struct fields need a known size, and initializing with a hard-coded list of elements doesn't make much sense if you don't even know how many elements you'll need).

Conscat · on Sept 29, 2023

C++ comes with a runtime which, among many other things, allows you to detect the microarchitecture and featureset of the environment you're running on using `__builtin_cpu_init()` which calls a dynamically linked function `__cpu_indicator_init()`. Then using the `cpu_dispatch`, `target`, or `target_clones` attributes you can compile multiple variations of an algorithm in your program and dynamically select the one to execute. This is referred to as a "fat binary", and the feature is "multifunctions" or "multiversioned functions".

Zig intends to support a similar feature but doesn't yet, at least not built into the language (you could certainly express this if you tried hard enough). I don't know about Rust, but I would be very surprised if it can't do this.

edit: I think I replied to the wrong comment >.<

geertj · on Sept 29, 2023

And C++:

https://en.cppreference.com/w/cpp/experimental/simd

This proposal has been around for a while a but it recently got some new momentum and seems to be on track for c++26. Gcc ships a version today for those wanting to try it.

janwas · on Sept 30, 2023

I'm curious what is the new momentum?

It's difficult to understand the value proposition of std::experimental::simd. Standardization has taken many years, lost the "load/store" function naming which just about everyone everywhere(?) is using, and only resulted in ~30 ops [1] that are also mostly achievable with compiler builtins or perhaps even autovectorization.

Four years ago, we stepped in with Highway to fill the gap; it now has about 250 ops [2] and an active community. I have not wanted to step on any toes, but wonder whether the ship has by now sailed on this?

[1]: https://en.cppreference.com/w/cpp/header/experimental/simd [2]: https://github.com/google/highway/blob/master/g3doc/quick_re...

geertj · on Oct 3, 2023

Late comment, just saw this.

Not sure where the momentum is coming from. The proposal has a new maintainer now, new drafts have been released, and there's a stated goal to get this included in C++26.

For me, the appeal for experimental/simd would be it being part of the stdlib. Have you considered proposing Highway for stdlib inclusion? From your description it appears to be a more complete implementation. I realize it's a long and arduous process, but I think it's the only way to make something truly the default implementation.

(And thanks for the link. I was not aware that Highway existed.)

janwas · on Oct 7, 2023

Interesting, thanks for sharing :)

At the time we open-sourced Highway, the standardization process had already started and there were some discussions.

I'm curious why stdlib is the only path you see to default? Compare the activity level of https://github.com/VcDevel/std-simd vs https://github.com/google/highway. As to open-source usage, after years of std::experimental, I see <200 search hits [1], vs >400 for Highway [2], even after excluding several library users.

But that aside, I'm not convinced standardization is the best path for a SIMD library. We and external users extend Highway on a weekly basis as new use cases arise. What if we deferred those changes to 3-monthly meetings, or had to wait for one meeting per WD, CD, (FCD), DIS, (FDIS) stage before it's standardized? Standardization seems more useful for rarely-changing things.

1: https://sourcegraph.com/search?q=context:global+std::experim...

2: https://sourcegraph.com/search?q=context:global+HWY_NAMESPAC...

anonymoushn · on Sept 30, 2023

Highway does seem like the obvious choice for a C++ SIMD abstraction layer!

Danidada · on Sept 29, 2023

Neat project!

However, I'm pretty sure OpenCV has their "universal intrinsics" and RISC-V with scalable vector registers is supported in the latest OpenCV version

Universal intrinsics (docs not updated): https://docs.opencv.org/4.x/d6/dd1/tutorial_univ_intrin.html Scalable RVV support: https://github.com/opencv/opencv/pull/22179

bch · on Sept 30, 2023

This might be asking a lot, but what’s the Little Schemer for vector processing? A fun on-ramp.

anonymoushn · on Oct 2, 2023

https://en.algorithmica.org/hpc/ and http://0x80.pl/ have some stuff about this, but the latter can be dense. I've had fun getting my hands dirty with some problems at https://highload.fun/ but there's not much direction unless you go to the telegram chat and ask people questions.

camel-cdr · on Sept 29, 2023

This doesn't seem to be upstreamed yet.

I hope they have real hardware performance numbers for the rv summit talk.

biocrusoe · on Sept 29, 2023

SIMDe maintainer here, I would welcome a PR; yes!

kierank · on Sept 29, 2023

The paper suggests FFmpeg uses intrinsics which is not correct.

There have been many SIMD abstraction layers created in the past but none of them will beat the raw speed of handwritten assembly. Try and implement something like vpternlogd in one of these abstraction layers.

cyber_kinetist · on Sept 30, 2023

I still think SIMD helper libraries (like xsimd or highway) have some good use in numerical computation and graphics, since you have so many complex equations to optimize that it's basically unrealistic to write all of it in assembly. And it's much better in terms of readability, xsimd has lots of operator overloading built in so you can still get readable math equations in SIMD code. Even if you get up to 80% of the achievable performance of assembly it's still a much better improvement then plain scalar code or relying on auto-vectorization. (And if that isn't enough you can start optimizing in assembly for only the most frequently used functions)

janwas · on Sept 30, 2023

We actually do use ternlog in several places in Highway :) Whenever we want to use a new immediate arg, we add new ops such as Not, Xor3, Or3, OrAnd, IfVecThenElse that also do something reasonable on other platforms.

BTW this reminds me of a colleague grumbling that what should have been a 20-minute patch to ffmpeg took a day, because it was written in assembly.

It is also quite possible to have large slowdowns due to assembly - all it takes is to forget a v prefix (VEX encoding), whereas intrinsics take care of that.

kierank · on Sept 30, 2023

Do you actually implement all permutations of vpternlogd?

The lightweight macro layer in ffmpeg takes care of v prefixes.

In FFmpeg, x264 and dav1d there are many different examples of code that couldn't be written in intrinsics or other abstraction layer.

https://twitter.com/FFmpeg/status/1705543447245988245?t=Ul9e...

janwas · on Sept 30, 2023

As mentioned, we implement what applications are using/requesting. Do we know how many permutations are used in ffmpeg?

hm, I vaguely remember there was a vzeroupper problem, perhaps one fell through the cracks.

Interesting, can you share more details on the magic? Looks mainly like function call overhead. If functions aren't called often, we can inline (by moving into headers or enabling LTCG/LTO).

If they are called often, are visible to the compiler, have internal linkage, but shouldn't be inlined, I'd be curious to learn why, and also why the compiler is then generating the full prolog/epilog.

anonymoushn · on Sept 30, 2023

Yeah, I think non-asm users have to use inlining to avoid clobbering all the vector registers on sysv ABI. I haven't really encountered cases where avoiding inlining is super important though.

dist1ll · on Sept 29, 2023

The main abstraction of intrinsics is register allocation, right? Is there anything else that can be gained by handwritten asm?

nkurz · on Sept 30, 2023

If you are trying to maximize port utilization, sometimes the exact instruction ordering can make a big difference. Compilers often want to "hoist" loads to the top, which can sometimes reduce performance. And sometimes you need a particular addressing mode to avoid a bottleneck.

In general, kierank is right: if you want to full optimize something, and you know what you want the actual code to look like, just write it in assembly. Nothing else gives you full control over loads and stores, and anything else leaves you at the mercy of some future compiler "optimization" stepping in to defeat you.

dist1ll · on Sept 30, 2023

> If you are trying to maximize port utilization, sometimes the exact instruction ordering can make a big difference.

Interesting. I thought you'd be at the mercy of the instruction scheduler for aggressive OoO cores.

anonymoushn · on Sept 30, 2023

I think so. For the most part I'm happy to have the compiler rearrange my intrinsics and decide where the spills should go (inevitably you have spills if you unroll the loop enough times to avoid stalls, because these instructions tend to be like "you can have 6 in flight at a time but that would take 18 registers lol". The main problem I've met here is that the compiler will emit useless instructions to narrow or widen integers in general-purpose registers (e.g. the result of movemask) but this can be solved by looking at the generated assembly and fixing the high-level code.

brigade · on Sept 30, 2023

Custom function ABIs. Though the maintenance overhead really isn't worth the reduced cache footprint, especially since asm writers are allergic to leaving behind any comments.

Also instruction scheduling. Low-end Cortex will probably be in-order till the end of time...

camel-cdr · on Sept 29, 2023

For rvv specifically there are a few things that aren't possible using the intrinsics abstraction.

E.g. in asm you can run the same instruction sequence with different vtype (element width and LMUL).

janwas · on Sept 30, 2023

We are able to do the same with Highway's RVV :)

dzaima · on Sept 30, 2023

I believe what camel-cdr is saying is being able to run the same code without duplication (say, a loop) which has no vsetvl-s inside, by conditionally choosing either an initial "vsetvli x0,x0,e32,m1" or "vsetvl x0,x0,e16,m1" or "vsetvl x0,x0,e32,m2" etc, which is just unachievable with the intrinsics as they hard-code vtype in each intrinsic.

It's an extremely fun idea (primarily just for code size though), but thankfully (?) its usability is restricted by load/store instrs hard-coding the element type, so the main use of this would end up for switching LMUL, which has very limited usefulness.

What Highway can support is generating multiple loops of different vtype from the same code, which effectively achieves the same thing, at the cost of machine code duplication.

camel-cdr · on Sept 30, 2023

> for switching LMUL, which has very limited usefulness

I currently have a quite usefull use case for it, I'm concerting utf8 to utf32 and if I've got an average utf8 character size of above 2 I could reduce the LMUL for that loop iteration.

This shouldn't actually improve performance that much in good rvv implementations, since you can use vl and not LMUL to schedule your execution units. Sadly this is currently not the standard, and ara is the only implementation, that does this I know of.

I think this wouldn't even be about code size reduction, consider an input, where there is basically a 50/50 probability LMUL can be reduced, that would be horrible for the branch predictor, but with only a branch over vsetvl, this could behave as a conditional vsetvl via instruction fusion. We'll have to see if such optimization become relevant once there is more hardware out there.

dzaima · on Sept 30, 2023

That's an interesting use-case, though I wouldn't be surprised if some impls really wouldn't like LMUL dynamically switching at runtime a lot (i.e. something like LMUL being forwarded at decode-time, so it couldn't decode after an unknown-LMUL vsetvl, ruining perf)

janwas · on Sept 30, 2023

Oh, I see, thanks for clarifying. Yes, I was referring only to "same source code" and agree our approach would generate multiple copies of the instructions.

brigade · on Sept 30, 2023

It's technically correct; FFmpeg has a tiny amount of NEON intrinsics for no particularly good reason. (well, if it had a lot the good reason would have been to avoid writing everything twice between A32 and A64...)

Despite all the other comments, this doesn't appear to be intended to be used to write SIMD across multiple platforms? Rather, it's to quickly port codebases with lots of existing platform-specific intrinsics to a new platform? For this paper in particular, so that RISC-V can run somewhat optimized code without having to spend thousands of man-years writing new RVV code.

mgaunard · on Sept 29, 2023

There are so many SIMD libraries nowadays.

I myself implemented one in the SSE4/Altivec days (later extended to AVX, AVX512 and NEON). There were only a few options then, but now everyone seems to be doing it.

adgjlsfhk1 · on Sept 29, 2023

and hug of death

KingLancelot · on Sept 29, 2023

We need to do better than ISA specific intrinsics.

There should be a simd.h header in the C standard library that contains typedefs for vector types, and various functions to operate on them as well as Operators for them.

Like my _Operator <symbol> <function name>; proposal, which requires no mangling.

snvzz · on Sept 29, 2023

RISC-V is rapidly building the strongest ecosystem.