Virtually all architectures have floats implemented in hardware - including NaN behavior. This means things like IEEE 754 float comparison is done in a single instruction, executed in a single cycle.
Explicitly checking for NaNs would (at least) double the cycle count in cases where that branch is not taken, and having to deal with the custom comparison is even worse. This is not helped by the fact that there isn't a single NaN value: there are both positive and negative NaNs, and there are 23 bits which are usually ignored but for which some values do have specific meaning.
To make it even worse, modern CPUs include vector extensions, allowing you to operate on multiple values at once. With AVX-512, you can compare 32 floats per clock cycle. I would not be surprised if switching to a custom comparison made some edge cases over 100x slower.
NaN is not a single value, it's a multitude of them: a NaN is defined by all exponent bits being set to 1, so in a double there are 2^53 different possible NaN bit patterns (though only 2^52 values, because the sign bit is not part of the "NaN value", and possibly half that if you interpret signaling as a variable property of a fixed value)
Yes. Many applications do not require that at most 1 nan and at most 1 zero be allowed in each collection but do have trouble if some items are not equal to themselves.
Explicitly checking for NaNs would (at least) double the cycle count in cases where that branch is not taken, and having to deal with the custom comparison is even worse. This is not helped by the fact that there isn't a single NaN value: there are both positive and negative NaNs, and there are 23 bits which are usually ignored but for which some values do have specific meaning.
To make it even worse, modern CPUs include vector extensions, allowing you to operate on multiple values at once. With AVX-512, you can compare 32 floats per clock cycle. I would not be surprised if switching to a custom comparison made some edge cases over 100x slower.