There are two ways to approach x87: either saying to heck with it and just using...

phire · on Nov 10, 2022

> It's also not possible to "fix" things by optimizing for the cases where the x87 unit's precision is set to the same as fp32 or fp64, as the precision flags don't impact the exponent range.

I've been meaning to look into this. Certainly you can't blindly optimise all x87 code sequences to fp32 or fp64. But some sequences are safe.

For example, adding two numbers and saving back to memory is safe to optimise (at least for the infinity case, I haven't double-checked the subnormal behaviour). It's only when you need to add three or more numbers that you run into issues (though you can go further, if all N numbers have the same sign, you will get the correct result, you just might have saturated at infinity a few operations earlier than native x87)

Same goes for multiplication of two numbers (and N numbers that all provably >= 1.0)

The question is if such code sequences are common enough to bother trying to identify at compile time and optimise.

olliej · on Nov 10, 2022

> For example, adding two numbers and saving back to memory is safe to optimise (at least for the infinity case, I haven't double-checked the subnormal behaviour). It's only when you need to add three or more numbers that you run into issues (though you can go further, if all N numbers have the same sign, you will get the correct result, you just might have saturated at infinity a few operations earlier than native x87)

No, you cannot. Operations you can optimise are negation, nan, and infinity checks (ignoring pseudo nans and pseudo infinity checks of course).

fp80 has a 15bit exponent, and functionally a 63 significand, vs fp64's 11bit exponent and 52bit significand. When setting x87 to a reduced precision mode you aren't switching to fp64, you're getting a mix with a 15 bit exponent and a 53 significand. The effect is that you retain 53bits of precision for values where fp64 has entered subnormals, and conversely you maintain 53 bits of precision after fp64 has overflowed. There are perf benefits to reducing precision in x87 (at least in the 90s), but the main advantage is consistent rounding with fp64 while in the range of normalized fp64 values.

phire · on Nov 10, 2022

The key to this optimisation idea is the exponent gets truncated back to 8 bits when being written back to memory.

For the example of two fp32s adding to infinity.

With 15bit exponent: The add results in a non-infinite with exponent outside the -125 to 127 range. Then when writing back to memory, it the FPU notices the exponent is outside of the valid range, clamps it and writes infinity to memory.

With 8 bit exponent: The add immediately clamps to infinity in the register. and then it writes to memory.

In both cases you get the same result in memory, so the result is valid as long as the in-register version is killed. And the same should apply to the subnormal case (I have not double checked the x87 spec). If you start with two subnormals that are valid f32s, add them, get a subnormal result and then write back to memory as a f32, it should be guaranteed to produce the same result with both a 15bit exponent and a 8bit exponent. It doesn't matter if the subnormal mantissa was truncated before writing back to the register, or writing back to memory. It was still truncated.

You only start getting accuracy issues start doing multiple additions in a row without truncating the exponent. If you add 3 floats, the result of A+B might be infinite, but A+B+C could result in a normal f32 if you had a 15bit exponent (when A+B is positive infinity, and C is negative)

This line of thought could potentially be pushed further. If you can prove (or guard) at compile time that all N floats in a sequence of N adds will be positive (and not subnormal), then you can't have a case where one of intermediary exponent exceeds 127 but then the final exponent is less than 128. If there is an infinite anywhere along the chain, it will saturate to infinity. With 15bit exponent, the saturation might not be applied until the f32 value is written to memory at the end, but because of the preconditions the optimiser can guarantee the same result in memory (either infinity or a normal) at the end while only using 8 bit exponents operations.

Most of the above should also apply to other operations like multiply. I've only done some preliminary thinking about this idea, enough to be sure that some operations could be optimised. I'm only fully confident about clamping to infinity case, and I'm going to be really bummed if when I get around to double-checking, there is something about how the x87 deals with subnormals that I'm not aware of. Or some other x87 weirdity.

olliej · on Nov 11, 2022

> The key to this optimisation idea is the exponent gets truncated back to 8 bits when being written back to memory.

That's incorrect - the clamping of the exponent only occurs if you were to use FST/m32 or FST/m64, but if you're using x87 you're presumably doing FSTP/m80fp so there is no truncation or rounding on store, regardless of the prevision flag in the control word.

It sounds like what you're trying to arrange is an optimization such that a rosetta like translator/emulator can optimize this highly awesome function to be performed entirely using hardware fp32 or fp64 support:

    fp32 f(fp32 *fs, size_t count) {
      // pseudo code obviously :D
      ControlWord cw = fstcw();
      cw.precision = Precision32;
      fldcw(cw);
      fp80 result = 0;
      for (unsigned j = 0; j < count; j++) {
        result += fs[j];
      }
      return (fp32)result;
      // pretend we restored state before returning :)
    }

The problem you run into though, is that an optimization pass can make decisions based on anything other than the code it is presented with. So your optimizer can't assume sign or magnitude here, so that += has to be able to under or overflow the range that fp32 offers.

Things get really miserable once you go beyond +/-, because you can end up in a position where an optimization to do everything in the 32/64 bit units means that you won't get observable double rounding.

This is kind of moot in the rosetta case as I don't believe we ever implemented support for the precision control bits

More fun are the transcendentals - x87 specifies them as using range reduction and so they are incredibly inaccurate (in the context of maths functions), especially around multiples of pi/4, and if you go test it you'll find rosetta will produce the same degree of terrible output :D

phire · on Nov 11, 2022

I was thinking more about functions along the lines of this vertex transform function that you might theoretically find as hot code in a late 90s or early 2000s windows game (before hardware transform and lighting).

    void transform_verts(fp32 *m, fp32 *verts, size_t vert_count) {
      // it's a game, decent chance it applies percision32 across the whole process
      // Especially since directx < 10 automatically sets it when a 3d context is created
      while (vert_count--) {
        verts[0] = verts[0] * m[0] + verts[0] * m[1] + verts[0] * m[2] + m[3];
        verts[1] = verts[1] * m[4] + verts[1] * m[5] + verts[1] * m[6] + m[7];
        verts[2] = verts[2] * m[8] + verts[2] * m[9] + verts[2] * m[10] + m[11];
        verts += 3;
      }
    }

Would be nice if we could optimise it all to pure hardware fp32 without any issues. But not really possible with those six operation long chains. And you are right, we can't really assume anything about the data.

But we can go for guards and fallbacks instead. Implement that loop body as something like

    loop: 
        // Attempt calculation with hardware fp32
        $1 = hwmul(verts[0], m[0])
        $2 = hwmul(verts[0], m[1])
        $3 = hwmul(verts[0], m[2])
        $4 = hwadd($1, $2)
        $5 = hwadd($3, $4)
        $6 = hwadd($5, m[3]) // any infs from above sub-equations will saturate though to here
        if any(is_subnormal_or_zero([$1, $2, $3, $4, $5]) || $6 is inf: // guard
           // one of the above subcalulcations became either inf or subnormal, so our
           // hwresult might not be accurate. recalculate with safe softfloat
           $8 = swadd(swmul(verts[0], m[0]), swmul(verts[0], m[1]))
           $6 = swadd(swadd($8, swmul(vert[0], m[2])), m[3])
    
        verts[0] = $6
    
        // repeat above pattern for verts[1] and verts[2]
    
        goto loop

I think that produces bit-accurate results?

Sure, it might seem complicated to calculate twice. But the resulting code might end up faster than just pure softfloat code across average data. Maybe this is the type of optimisation that you only attempt at the highest level on a multi-tier JIT for really hot code. You could perhaps even instrument the function first to get an idea what the common shape of the data is.

> This is kind of moot in the rosetta case as I don't believe we ever implemented support for the precision control bits

So it's already producing inaccurate results for code that sets precision control? Might as well just switch over to hardware fp32 and fp64 /s

I guess for the rosetta usecase, Intel macs didn't until 2006 and so most of the install base of x86 programs will be compiled with SSE2 support, and commonly 64bit.

Probably the most common usecase for x87 support in rosetta will be 64bit code used long doubles and compilers/ABIs annoying implemented them as x87.

olliej · on Nov 13, 2022

Sorry for delay (surgery funsies)

> So it's already producing inaccurate results for code that sets precision control? Might as well just switch over to hardware fp32 and fp64 /s

:D

But in practice the only reason for changing the x87 precision is performance, which was then simply retained in hardware for backwards compatibility. Modern code (as in >= SSE era) simply uses fp32 or fp64 which is faster, more memory compact, has vector units, has a much more sane ISA, etc. Anyone who does try to toggle x87 mode in general is in for a world of hurt because the system libraries all assume the unit is operating in default state.

You are correct that the only reason x86_64 needs x87 is that the unix x86_64 ABI decided to specify the already clearly deprecated format the implementation of long double. I often looked wistfully at win64 where long double == double.