Hacker News new | past | comments | ask | show | jobs | submit login

Just looking at a snippet of code:

    // Assumes zmm is bitonic and performs a recursive half cleaner
    template <typename vtype, typename zmm_t = typename vtype::zmm_t>
    X86_SIMD_SORT_INLINE zmm_t bitonic_merge_zmm_64bit(zmm_t zmm) {
      // 1) half_cleaner[8]: compare 0-4, 1-5, 2-6, 3-7
      zmm = cmp_merge<vtype>(
        zmm, vtype::permutexvar(_mm512_set_epi64(NETWORK_64BIT_4), zmm), 0xF0);
To me, this why AVX didn't get widespread use. The code is practically hieroglyphics.

If you want humans to use a performance feature, it needs to be easy to use. Preferably fully integrated into the compiler so that you don't need to be aware of it at all.

The level of abstraction is wrong - When they invent AVX1024, recompiling this code won't make use of it. Yet the code itself is probably harder to understand and reason about than just a few lines of assembly code.




> To me, this why AVX didn't get widespread use

AVX has seen very, very widespread use across mobile, desktop, gaming, and server platforms. It's over 10 years old. Probably every hand-optimized vectorized x86 routine in the past 10 years has seen AVX thrown at it. Not every programmer can write it, but it sees absolutely tons of widespread usage and can be written by many competent programmers.

> If you want humans to use a performance feature, it needs to be easy to use.

In practice for general purpose "messy" compute kernels (e.g. parse this JSON/CBOR bullshit with SIMD) there is still a very wide gap between hand-written intrinsic code and what the compiler can generate. Most compilers in popular languages don't have the leeway to automatically perform the necessary optimizations/"setup" that a human must perform to fully exploit these tools, so this is not an easy fight to win.

For limited domains, and with the correct semantic design, however, you can either achieve or exceed human performance with high level semantics e.g. Halide.

> Yet the code itself is probably harder to understand and reason about than just a few lines of assembly code.

It's really not. Hacker news sucks shit to read code on, but the above code is in no way harder or worse than a raw assembly routine using AVX instructions. If anything it's less error prone, because the compiler takes care of a ton of incidental but necessary drudgery too, such as handling PIC (something 99% of people forget immediately) and eliminating the need for an outlined function. Not to mention you can leave the compiler to do all the annoying shit like sinking or coalescing stores/loads along the codepath, accurate DWARF/debug information without things breaking, etc.


But their point is that

>every hand-optimized vectorized x86 routine

is approximately 0 percent of code.


Approximately 0 percent of a code needs optimization.

But very often approximately 0 percent of a code is 90+% of a runtime


Sure but OP's point is that that tiny fraction of code is (probably, I'm just guessing) responsible for a disproportionately large fraction of time spent actually running work. AVX is AFAIK mostly used in hand rolled, low level library code that then is used by a whole lot of consumers.

I know I've seen pretty huge speedups in my own code for "free" just from switching to an AVX version of BLAS. You can just think about how many different programs use BLAS (which is itself highly arcane internally), and AVX is definitely in a ton of other low level libraries out there.


Hieroglyphs or not, with some human cleverness you can achieve pretty amazing performance.

It's not that bad if you just study it a bit and put your mind into it. Just a sequence of relatively simple operations (+ masking when you need a partial operation), data shuffling, loads and stores.

Unfortunately to extract all of that performance, you do need to know your target architecture, be it SSE, AVX2/512, NEON, SVE or whatever.

You might also need to know some architectural details, like register renaming, out of order execution, branch prediction, load/store buffers, automatic prefetching, cache architecture, sometimes even details like CPU internal ring bus. That's just how it is when you want to extract the last drop of performance.

Sometimes compilers can be pretty great, just not always. Unfortunately sometimes beautifully autovectorized code just stops being so due to some small change in code somewhere.

Luckily most of us don't need to care. Most people can just use optimized libraries instead.


> Hieroglyphs or not, with some human cleverness you can achieve pretty amazing performance.

I'll never forget hand-optimizing some DNA analysis code into assembly and achieving the theoretical performance bottleneck that is the memory throughput between the CPU and main RAM on a server with 800GB of RAM.

Taking something that took minutes and turning it into something that takes seconds is a flipping amazing feeling.

Could it get even faster? Probably, but that would require very significant data structure changes with an algorithmic change to match. Would it be worth it? Well, probably when the database reaches multiple TB in size. But I've since moved on from that company so it's someone else's problem.


It's often about memory bandwidth nowadays. That and locality of reference, random access is expensive. You can compute a ton for the cost of one random memory access.


Tangential Question (purely curious): How to you validate your code produces the same (presumedly correct) results as the non-optimized (presumedly HLL) code for something as complex and variable as DNA analysis? I run into an analogous concern when people talk about "we need to rewrite this exhaustively tested with decades of track record Fortran 77 code in (whatever the new hotness in PLT is)".


AVX512 got far far FAR easier to read and write than previous iterations. Lots of 'missing' operations were added and the set seems far more thought for autovectorizer compiler passes.

Still, yes, it's difficult to grok because it's a low-level vector assembler, and as any low-level assembler, can be hard to follow at first.

AVX512 also has amazing features like mask registers, all the gfni, bit extract/compress and vpternlog, vpopcount, etc. features that once you've grokked you start seeing vector code as even more magic.

My main gripe with AVX512 is that it's still hard to get amazing performance because of memory bandwidth, unoptimized streaming issues, cache locality issues. You can write amazing compact avx512 that will still have poor performance because it's all about feeding the FMA units.

But, if you want to see a simpler high level language that will generate fast code for AVX512 targets, checkout ISPC, I've had very nice successes in the past, writing idiomatic code and getting better performance than my shitty experiments with intrinsics. It's not for every kind of code, but when it fits, it's just nice. And ISPC generates C-callable objects/libraries so it's relatively nice to integrate in a build pipeline.


The way I see it, you could argue for say, wing-nuts to be used everywhere because they can be used tool-free, but then you can't tighten them as much, can't put them is confined spaces etc. AVX is meant to be used with a specialist tool interface that has knowledge of lower level details and is made by someone else who adds a simple handle for you, Array.sort being this example.


Luckily for us, we don't need to actually understand this C++ code to use AVX 512 for faster sorting. We can just do

    Arrays.sort(arr)
in Java!


But then people say it's too verbose. Maybe it needs to be:

A.s(arr)


arr.sort() or sort(arr) seem reasonable to me


You can static import Arrays.sort, so the second option could work.


You're not really supposed to write AVX yourself - the compiler should be doing that for you. And it will, if you write your code in a SIMD-compatible way and turn on the right compiler flags.

I do agree that it is the wrong level of abstraction. Explicitly stating the SIMD width leads to a compatibility nightmare. RISC-V vector instructions instead use an explicit "vector width" register, which pretty much entirely solves this problem.


> You're not really supposed to write AVX yourself - the compiler should be doing that for you. And it will, if you write your code in a SIMD-compatible way and turn on the right compiler flags.

Take it from experience: sure you can write high-level code that is SIMD-compatible. But the compiler is garbage at understanding the semantics and will write terrible SIMD code.


The best thing a current compiler can provide is probably replacing intrinsics with more conventional-looking things like GCC's Vector extensions[1] and C++'s simd<T>. Even then you'd need to do a little bit of union work for the cooler operations.

[1] https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html


The union work turns out to be a lot because compiler extensions and simd<> barely support anything beyond operators.

Examples of what's missing: interleaved load/store, compress/expand, software AES/CLMUL, popcount, lzcnt, saturated add/sub, 128-bit compare/minmax, fixed-point mul, mask find/set, masked load/store, scatter/gather, reductions.

Highway supports those (and >200 operations in total) on all platforms.


I 100% agree that there's a lot of room for improvement here.


Compilers will always be terrible at vectorizing code because the required transformation is architectural. It would require the compiler to understand your code well enough to replace the scalar algorithms and data structures with new ones that are semantically equivalent in all contexts with all necessary invariants preserved (e.g. how memory is organized). The code transformation would be very non-local.

Compilers can't generally rewrite your scalar code as vector code for the same reason they can't rewrite your b-tree as a skip list.


hm, that seems optimistic for this use-case. I heard from a compiler engineer that autovectorizing a sort (which is full of permutes/shuffles) is much harder and is likely to remain infeasible for quite some time.


GPUs have a crossbar that allows for high speed lane-to-lane permutes and bpermutes, but it's still slow compared to butterfly shuffles.

I do believe that compilers can optimize any movement pattern into the right butterfly shuffles (not today in the general case. Modern compilers in CUDA are impressive but this is a hard problem) but I'm convinced that the programmer needs to be aware of the low level difficult nature of many-to-many data movements on a 16-wide AVX512 register, or a 32-wide GPU block / warp / wavefront.

--------

EDIT: I'm like 90% sure some dude at Bell Labs from 1950s working on CLOS network or Benes network design probably has an efficient representation for many-to-many data shuffles on a parallel architecture. But I'm not PH.d enough to have read all those papers or keep up with those old designs.

Many-to-many data movements is traditionally a networking and routing problem. But SIMD-programmers and SIMD-chip designers are starting to run up against this problem... because a ton of parallel programming is about efficient movements of data between conceptual lanes and/or threads.


But this is exactly that: Source code for a compiler integrating SIMD instructions into an intrinsic function.


It doesn't have to be as verbose and cryptic as intrinsics; wrapper functions can help a lot :) Is this more readable? https://github.com/google/highway/blob/master/hwy/contrib/so...


Erm... No it isn't.

Have you ever written bitonic sort? It's not an easy algorithm to do quickly.

This seems to have implemented bitonic sort in just a few short primitives you can look up at https://www.intel.com/content/www/us/en/docs/intrinsics-guid...

--------

How would you write a fixed 16 element bitonic sort? In Python or whatever?

I dunno, the recursive template seems brilliant to me. Bitonic sort is innately a recursive process, but here we get compile time recursion and optimal assembly code at the end...


Damn if only someone could write a fancy helper function so that everyone doesn't have to wrestle with AVX... Oh, wait...


Writing a virtual machine or compiler is always hard! If you write code for the JVM, going forward you will now use the instructions while writing nice clean Java…


> To me, this why AVX didn't get widespread use.

This is flat out false. The rest of the comment is therefore superfluous.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: