Why would there be any difference?

amluto · on June 28, 2020

Because every x86_64 CPU supports SSE2, so compilers can assume it exists and the ABI passes floating point arguments in SSE2’s FP registers. This means that every sane compiler for x86_64 will use SSE2 and won’t use x87, and the problem won’t occur.

In contrast, the 32-bit C ABI passes floating point args in x87 registers, and there are 32-bit CPUs without SSE2, so x87 is used by default.

x87 is, in quite a few respects, awful.

ndesaulniers · on June 28, 2020

For the Linux kernel, things get slightly more complicated with -mgeneral-regs-only. (https://bugs.llvm.org/show_bug.cgi?id=30792 / https://reviews.llvm.org/D38479).

A few drivers use floating point values/calculations. Which way are they rounded when expressions are folded at compile time? Does that match what would happen at runtime?

The x86 kernel also disables SSE generally (kernel_fpu_{begin|end} via -mno-{x87|sse|sse2|....}. This generally lowers the overhead of context switching as there's fewer registers to save+restore, nowadays the SIMD vectors on x86 are huge.

Further, it uses an 8B stack alignment, which makes it so that when FP is used, the compiler cannot select instructions that require 16B aligned operands to be loaded/stored from/to the stack.

Finally, -Ofast implies -funsafe-math. And that's not "fun safe math." Actually had folks at work use "-Ofast" because "why wouldn't I want my code to be fast?" then ask why their reciprocals were also wrong.

amluto · on June 28, 2020

> For the Linux kernel, things get slightly more complicated with -mgeneral-regs-only. (https://bugs.llvm.org/show_bug.cgi?id=30792 / https://reviews.llvm.org/D38479).

That looks like an ARM issue.

> Further, it uses an 8B stack alignment, which makes it so that when FP is used, the compiler cannot select instructions that require 16B aligned operands to be loaded/stored from/to the stack.

On x86_64, on a recent GCC, this all works correctly. If you end up with a 16-byte-aligned variable (like an SSE vector) on the stack, GCC will correctly align the stack in the prologue. On older GCCs, there were a serious of obnoxious I-know-better-than-you-isms in the command line option handling that made this all malfunction.

I seem to be missing the Share button on godbolt.org, so no link. But you can test with:

    typedef int v4si __attribute__ ((vector_size (16)));
    v4si *v;
    
    int func(void)
    {
        v4si z;
        v = &z;
    
        return 0;
    }

Add -mpreferred-stack-boundary=3 to the options.

olliej · on June 27, 2020

My guess is that they’re using long double, and the abi definition of long double may be different on x86_64 (eg on windows long double is logically just double)