Expressive Vector Engine – SIMD in C++

vblanco · 2025-01-08T11:50:55 1736337055

Interesting library, but i see it falls back into what happens to almost all SIMD libraries, which is that they hardcode the vector target completely and you cant mix/match feature levels within a build. The documentation recommends writing your kernels into DLLs and dynamic-loading them which is a huge mess https://jfalcou.github.io/eve/multiarch.html

Meanwhile xsimd (https://github.com/xtensor-stack/xsimd) has the feature level as a template parameter on its vector objects, which lets you branch at runtime between simd levels as you wish. I find its a far better way of doing things if you actually want to ship the simd code to users.

kookamamie · 2025-01-08T12:49:52 1736340592

100% agreed. This is the main reason ISPC is my go-to tool for explicit vectorization.

janwas · 2025-01-08T16:25:30 1736353530

+1, dynamic dispatch is important. Our Highway library has extensive support for this.

Detailed intro by kfjahnke here: https://github.com/kfjahnke/zimt/blob/multi_isa/examples/mul...

spacechild1 · 2025-01-08T12:31:08 1736339468

Thanks, that's an important caveat!

> Meanwhile xsimd (https://github.com/xtensor-stack/xsimd) has the feature level as a template parameter on its vector objects

That's pretty cool because you can write function templates and instantiate different versions that you can select at runtime.

vblanco · 2025-01-08T14:17:33 1736345853

Yeah thts the fun of it, you create your kernel/function so that the simd level is a template parameter, and then you can use simple branching like:

if(supports<avx512>){ myAlgo<avx512>(); } else{ myAlgo<avx>(); }

Ive also used it for benchmarking to see if my code scales to different simd widths well and its a huge help

dyaroshev · 2025-01-08T19:48:52 1736365732

FYI: You don't want to do this. `supports<avx512>` is an expensive check. You really want to put this check in a static.

spacechild1 · 2025-01-09T13:12:32 1736428352

I guess this was just pseudo-code. Of course you don't want to do a runtime feature check over and over again.

dyaroshev · 2025-01-08T19:27:02 1736364422

Our answer to this - is dynamic dispatch. If you want to have multiple version of the same kernel compiled - compile multiple dlls.

The big problem here is: ODR violations. We really didn't want to do the xsimd thing of forcing the user to pass an arch everywhere.

Also that kinda defeats the purpose of "simd portability" - any code with avx2 can't work for an arm platform.

eve just works everywhere.

Example: https://godbolt.org/z/bEGd7Tnb3

janwas · 2025-01-08T19:51:23 1736365883

It is possible to avoid ODR violations :) We put the per-target code into unique namespaces, and export a function pointer to them.

dyaroshev · 2025-01-08T20:08:12 1736366892

You can do many thing with macros and inline namespaces but I believe they run into problems when modules come into play. Can you compile the same code twice, with different flags with modules?

janwas · 2025-01-09T10:05:04 1736417104

We use pragma target instead of compiler flags :)

dyaroshev · 2025-01-09T14:36:03 1736433363

I don't think we understand each other.

We want to take one function and compile it twice:

``` namespace MEGA_MACRO {

void foo(std::span<int> s) { super_awesome_platform_specific_thing(s); }

} // namespace MEGA_MACRO ```

Whatever you do - the code above has to be written once but compiled twice. In one file/in many files - doesn't matter.

My point is - I don't think you can compile that code twice if you support modules.

janwas · 2025-01-09T15:02:01 1736434921

I think I do understand, this is exactly what we do. (MEGA_MACRO == HWY_NAMESPACE)

Then we have a table of function pointers to &AVX2::foo, &AVX3::foo etc. As long as the module exports one single thing, which either calls into or exports this table, I do not see how it is incompatible with building your project using modules enabled?

(The way we compile the code twice is to re-include our source file, taking care that only the SIMD parts are actually seen by the compiler, and stuff like the module exports would only be compiled once.)

dyaroshev · 2025-01-09T15:15:56 1736435756

> is to re-include our source file

Yeah - that means your source file is never a module. We would really like eve to be modularized, the CI times are unbearable.

I'd love to be proven wrong here, that'd be amazing. But I don't think google highway can be modularized.

janwas · 2025-01-09T16:19:40 1736439580

What leads you to that conclusion? It is still possible to use #include in module implementations. We can use that to make the module implementation look like your example.

Thus it ought to be possible, though I have not yet tried it.

dyaroshev · 2025-01-09T17:49:47 1736444987

Well.

You have a file, something like: load.h

You need to include it multiple times, compiled with different flags.

So - it's never going to be in load.cxx or whatever that's called.

janwas · 2025-01-09T18:19:44 1736446784

As mentioned ("re-include our source file"), we are indeed able to put the SIMD code, as well as the self-#include of itself, in a load.cxx TU.

Here is an example: https://github.com/google/gemma.cpp/blob/9dfe2a76be63bcfe679...

dyaroshev · 2025-01-09T18:49:45 1736448585

I don't think this works if your files are modules.

Let's stop here, it doesn't seem like we understand each other.

vlovich123 · 2025-01-08T17:21:52 1736356912

Since you seem knowledgeable about this, what does this do differently from other SIMD libraries like xsimd / highway? Is it the addition of algorithms similar to the STD library that are explicitly SIMD optimized?

dyaroshev · 2025-01-09T01:19:06 1736385546

The algorithms I tried to make as good as I knew how. Maybe 95% there. Nice tail handling. A lot of things supported. I like or interface over other alternatives, but I'm biased here. Really massive math library.

Conscat · 2025-01-08T17:55:31 1736358931

EVE is personally my favorite SIMD library in any programming language. It's the only one I've tried that provides masked lane operations in a declarative style, aside from SPMD languages like CUDA or OpenMP. The [] syntax for that is admittedly pretty exotic C++, but I think the usefulness of the feature is worth it. I wish the documentation was better, though. When I first started, I struggled to figure out how to simply make a 4-lane float vector that I can pass into shaders, because almost all of the examples are written for the "wide" native-SIMD size.

thrtythreeforty · 2025-01-09T03:58:05 1736395085

This library's eve::soa_vector is the first attempt I've seen at dealing with the "SOA problem," which is that if you write good, parallel-friendly code, all your types go to hell and never come back because the language can't express concepts like "my object is made from element 7 of each of these 6 pointers." Instead you write really FORTRAN-looking array processing code with no types or methods in sight.

Does anyone know of other libraries that help a C++ programmer deal with struct-of-arrays?

dyaroshev · 2025-01-09T15:09:48 1736435388

cppcast talked about soagen https://cppcast.com/soagen/ I didn't look into it too much.

thrtythreeforty · 2025-01-09T19:50:00 1736452200

Thank you!

nickpsecurity · 2025-01-08T16:00:51 1736352051

I also found this looking for portable SIMD:

https://github.com/google/highway

dyaroshev · 2025-01-08T19:48:01 1736365681

Hi!

Thanks for your interest in the library.

Here is a godbolt example: https://godbolt.org/z/bEGd7Tnb3 Here is a bunch of simple examples: https://github.com/jfalcou/eve/blob/fb093a0553d25bb8114f1396...

I personally think we have the following strenghs:

* Algorithms. Writing SIMD loops is very hard. We give you a lot of ready to go loops. (find, search, remove, set_intersection to name a few). * zip and SOA support out of the box. * High quality codegen. I haven't seen other libraries care about unrolling/aligning data accesses - meanwhile these give you substantial improvements. * Supporting more than transform/reduce. We have really decent compress implemented for sse/avx/neon implemented for example.

The following weaknesses:

* We don't support runtime sized sve/rvv (only fixed size). We tried really hard, but unfortunately just the C++ language refuses to play ball there. Here is a discussion about that https://stackoverflow.com/questions/73210512/arm-sve-wrappin...

If this is something you need we recommend compiling a few dynamic libraries with the correct fixed lengths. Google Highway manage to pull it off but the trade off is a variadics interface that I personally find very difficult.

* Runtime dispatch based on arch.

We again recommend dlls for this. The problem here is ODR. I believe there is a solution based on preprocessor and namespaces I could use but it breaks as soon as modules become a thing. So - in the module world - we don't have an option. I'm happy for suggestions.

* No MSVC support

C++20 and MSVC is still not a thing enough. And each new version breaks something that was already working. Sad times.

* Just tricky to get started.

I don't know what to do about that. I'm happy to just write examples for people. If you wanna try a library - please create an issue/discussion or smth - I'm happy to take some time and try to solve your case.

We talked about the library at CppCon: https://youtu.be/WZGNCPBMInI?si=buFteQB1e1vXRT5M

If you want to learn how SIMD algorithms work, here are a couple of talks I gave: https://youtu.be/PHZRTv3erlA?si=b87DBYMDskvzYcq1 https://youtu.be/vGcH40rkLdA?si=WL2e5gYQ7pSie9bd

Feel free to ask any questions.

janwas · 2025-01-09T10:17:56 1736417876

> Google Highway manage to pull it off but the trade off is a variadics interface that I personally find very difficult.

I'm curious what you mean by 'variadics', and what exactly you find difficult?

People new to Highway are often surprised by the d/tag argument to loads that say whether to load half/full vector, or no more than 4 elements, etc. The key is to understand these are just zero-sized structs used for type information, and are not the actual vector/data. After that, I observe introductory workshop participants are able to get started/productive quickly.

dyaroshev · 2025-01-09T14:54:02 1736434442

I struggle to read the highway documentation, it focuses on things that are unrelated to me. So sorry if I'm wrong.

Let me write the std::ranges code and ask you to write them with highway.

https://godbolt.org/z/3s1b8P3sj

PS: this is how it looks in eve: https://godbolt.org/z/Kzxqqdrez

janwas · 2025-01-10T16:02:32 1736524952

Thanks for sharing :) Any thoughts on what kind of things you are looking for and didn't find?

I cannot recall anyone saying this kind of thing is a bottleneck for them. We don't use std::range, but searching for a negative value can look like: https://gcc.godbolt.org/z/8bbb16Eea

It looks like smaller codegen than EVE's https://godbolt.org/z/fEn9r175v?

dyaroshev · 2025-01-11T13:39:44 1736602784

Thanks for this example.

Can you write the second one two? With two ranges? That's where I believe the variadics will be.

FYI: The codegen is smaller because the loop is not unrolled. That's a 2x slower on my measurements. + at least I don't see any aligning of memory accesses, that'd give you another third improment when the data is in L1. You really should fix that.

janwas · 2025-01-13T10:08:46 1736762926

We have a different philosophy: not supporting/encouraging needlessly SIMD-hostile software. We assume users properly allocate their data, for example using the allocator we provide. It is easy to deal with 2K aliasing in the allocator, but much harder later. At least in my opinion, this seems like a better path than penalizing all users with unnecessary (re)alignment code.

We have not added a FindIf for two ranges because no one has yet requested that or mentioned it is time-critical for their use cases.

dyaroshev · 2025-01-13T20:49:17 1736801357

That definetly doesn't apply to unrolling.

In eve the ability to zip ranges is fundamental and is very important

janwas · 2025-01-14T10:22:28 1736850148

I agree unrolling is usually helpful :)

shadowpho · 2025-01-08T16:14:06 1736352846

Wait what about AMD? They only claim support for intel and arm

dyaroshev · 2025-01-08T19:27:43 1736364463

AMD we support pretty well. I tested Zen1 and a bit Zen4

Sadiinso · 2025-01-08T18:14:27 1736360067

« AMD » is x86