A bit of background on compilers exploiting signed overflow

CJefferson · on May 8, 2016

One other advantage of signed overflow not listed here -- loop termination. Consider the following loop:

    for(int i = A; i <= B; i+=4) ...

This loop can be infinite if overflow is defined, and B is less than 4 from the largest value. Checking for this is a pain (and extra code), and in practice (almost) no-one would ever mean to write this and get an infinite loop when B is close to the top of the integer range.

Special case checking for this case is a pain for various kinds of loop unrolling / peeling, and also removing empty loops if they don't contain any code (which often comes up in practice with C++ templates).

cokernel_hacker · on May 8, 2016

Speaking as a professional compiler engineer, this is the biggest reason why signed overflow is treated as UB.

It is incredibly useful to talk about a loop's trip count or the values an induction variable can have.

It turns out that taking advantage of signed overflow lets us do that in more places, a boon to all the programs which don't dynamically execute signed overflow.

The tradeoff, however, is that the programing model gets way more complicated. I have talked to the engineer who introduced this optimizing power to our compiler and he sorta regrets it.

IMO, we need better, higher level ways of talking about a loop's iteration space. Today, in C or C++, we are forced to represent the iteration space of a loop using the language's algebra. This is severely limiting when each operation on a signed type might introduce UB. With a higher level mechanism, we can separate the behavior that we want for loops from the behavior we want for everything else in a nice, clean way.

pcwalton · on May 8, 2016

> With a higher level mechanism, we can separate the behavior that we want for loops from the behavior we want for everything else in a nice, clean way.

Yes! One decisive advantage that high-level iteration constructs have over for loops is that they're easier to optimize.

One obvious example here is bounds checks: there's no need to have infrastructure to optimize out bounds checks over arrays during loops if you have a construct that eliminates the indexing entirely.

btrask · on May 8, 2016

Genuine question: what do you do when you want to implement your own loop mechanism, iterator, array type, or whatever?

cokernel_hacker · on May 8, 2016

It is the job of the language to ensure that language primitives are reusable and composable.

I think this would apply to whatever mechanism is designed to describe a loop's iteration space.

Sean1708 · on May 9, 2016

Most languages use some sort of interface whereby if you implement certain methods for a type then that type can be iterated over using a for loop.

dsfuoi · on May 8, 2016

It's just one line. Put this before the loop if the code logically shouldn't trigger this.

  assert( B < B_TYPE_MAX - 4 );

Or use an if statement if it could trigger at runtime.

Also the code is clearer on intent.

pcwalton · on May 8, 2016

You misunderstood the problem. The problem isn't that programmers can't write loops that are easy to optimize. It's rather that, in practice, they don't write loops that are easy to optimize. That's ultimately a problem with the C language.

dsfuoi · on May 8, 2016

If the loops can be written and the programmers don't write them, the problem is with the language!?

No, it's pretty clear where the problem is with C, programmers.

pcwalton · on May 8, 2016

The easiest way to write the loop is the way that is hard to optimize. This is a fact that is beyond dispute: empirically, programmers use int to index over arrays in for loops without writing asserts. That is in fact a problem with the language.

dsfuoi · on May 8, 2016

programmers use int to index over arrays in for loops without writing asserts. That is in fact a problem with the language.

Ok let me get this clear, you're saying that the language is at fault because the programmers don't write correct code. I have no words.

stouset · on May 8, 2016

Emphatically yes.

When designing tools for use by humans, you need to accept the fact that humans, with all our failings, will be the ones using the tools. So tools should be made resilient against typical ways in which humans fail.

We've been paying for shortcomings in C's design for decades with bugs and security failures that simply don't happen in other languages. That you refuse to see this is baffling to me.

pcwalton · on May 8, 2016

> Ok let me get this clear, you're saying that the language is at fault because the programmers don't write correct code.

Yes.

A good language makes the correct thing the easy thing.

rdc12 · on May 9, 2016

When a common idiom becomes the wrong way to do something, someone messed up pretty badly (and preexisting idiom at that).

Arguably the designers of amd64 should have caught this before releasing the ABI or the language design could have been specified so this wasn't an issue in the first place.

haberman · on May 8, 2016

Can assert() actually function this way? Can you use assert to tell the optimizer it can assume something is true?

I've noticed that memcpy(x, y, 4) on x86 can generate very efficient code (register move), but on ARM it expands to something much more verbose because the addresses might not be aligned.

Could this effectively function as a way of promising to the compiler that the addresses are aligned?

    void move4_aligned(void *dst, const void *src) {
      assert(((uintptr_t)dst & 0x3) == 0);
      assert(((uintptr_t)src & 0x3) == 0);
      memcpy(dst, src, 4);
    }

quotemstr · on May 8, 2016

> Can assert() actually function this way? Can you use assert to tell the optimizer it can assume something is true?

Trivially.

  #ifdef NDEBUG
  # define assert_and_assume(cond) if(!(cond))     
    __builtin_unreachable((cond))
  #else
  # define assert_and_assume assert
  #endif

haberman · on May 9, 2016

Interesting! This assert_and_assume() seems strictly better than vanilla assert() for any predicates that don't have side effects. But I guess you have to be sure that the compiler is able to deduce that there aren't side effects and feels comfortable optimizing away the predicate in release mode.

nkurz · on May 8, 2016

Can assert() actually function this way? Can you use assert to tell the optimizer it can assume something is true?

Yes, but it would have the potentially surprising behavior that compiling your release version with -DNDEBUG might slow it down. We ran into this a couple years ago: https://plus.google.com/+DanielLemirePhD/posts/BTSams19Ero

Recent compilers support some extensions that can do it better. GCC and CLang use __builtin_assume_aligned(), and ICC uses __assume_aligned() (haven't tested the synonym).

rav · on May 8, 2016

The biggest takeaway is the last paragraph:

> On most of the machines you're likely to use, "size_t" for loop counters is a good idea where signed values aren't required. It's unsigned and generally as wide as addresses, so there's normally no extra code for zero/sign extends, and the type is standard.

Just work with size_t and be happy!

If you need to iterate through a sequence a[0], a[1], ..., a[n-1] in backwards order, the while-condition "i >= 0" of course won't work with size_t (which is unsigned); instead, use the following idiom:

for (size_t i = n; i--;) f(a[i]);

The above code will call f(a[n-1]), f(a[n-2]), ..., f(a[0]) with no issue, thanks to the post-decrement operator.

fdej · on May 8, 2016

Quick: how would you implement for (i = n; i >= 0; i -= k) with unsigned integers?

I just use a signed pointer-size type for all sizes and counters, and stay happy...

detrino · on May 8, 2016

Your loop:

    void TestSigned(int min, int max, int step) {
      for (int i = max; i >= min; i -= step) F(i);
    }

Becomes this with unsigned:

    void TestUnsigned(unsigned min, unsigned max, unsigned step) {
      while (true) {
        F(max);
        if (max < min + step) break;
        max -= step;
      }
    }

Now if I want to transform my loop to operate on pointers or iterators, the transformation is trivial:

    void TestUnsigned(unsigned *min, unsigned *max, unsigned step) {
      while (true) {
        F(*max);
        if (max < min + step) break;
        max -= step;
      }
    }

Quick: Do the same for yours.

dsfuoi · on May 8, 2016

It looks easy but it isn't.

The second example causes unexpected behavior (and probably ub later) if: min + step > MIN_TYPE_MAX

The last example causes undefined behavior if: max - min < step - 1.

pbsd · on May 8, 2016

Something like for(size_t i = 0; i <= n; i += k) { work_with(n - i); }?

This sort of loop always comes up in these discussions, and I was never convinced that signed integers were worth it for that alone.

Sean1708 · on May 9, 2016

To be honest I think there are good arguments to both sides, and I don't think there is really a one size fits all solution here. If you use unsigned you've got to pay attention to certain corner cases, if you use signed you've got to pay attention to different corner cases, it's up to you to decide which is more common in your code.

tzs · on May 8, 2016

If n is not within k of the maximum unsigned integer:

  i = n+k;
  do {
    i -= k;
    ...
  } while (i >= k);

If n is close enough to the maximum unsigned integer for n+k to overflow:

  i = n;
  goto skip_sub;
  do {
    i -= k;
  skip_sub:
    ...
  } while (i >= k);

CJefferson · on May 8, 2016

My bigger problem in the past has been loops of the type:

    for(int i = 0; i < a-b; ++i) f(a[i]);

If a < b, then this works cleanly, with no loops occuring. However:

    for(unsigned i = 0; i < a-b; ++i) f(a[i]);

Instead casts a-b to a large unsigned integer, and horrible things happen. Of course, I could add a pre-check to my loop, but I don't really want to (and of course, I have to remember to do that).

tacos · on May 8, 2016

If you up the warnings, you'll even get "signed/unsigned" mismatch if you fail to use size_t with STL containers with gcc.

ape4 · on May 8, 2016

If its a standard loop, the compiler could replace `int` with `size_t` as an optimization.

msbarnett · on May 8, 2016

This amounts to exploiting undefined overflow. It's literally the optimization the link describes.

skybrian · on May 8, 2016

It sounds like making the width of ints implementation specific in C was a failure. It was supposed to allow better performance (allowing it to be set to the register width) but instead we get the worst of both worlds - portable code can't depend on its size but compilers can't practically change the size.

cjensen · on May 9, 2016

There's an added complication: in C, there are just five[1] names available for signed values: signed char, short, int, long, and long long. Additional, the number of bits assigned to each of those must be >= to the previous name.

Suppose your C implementation wants to have signed types for 8-bit, 16-bit, and 32-bit values. If int is 64-bits, then you literally don't have names left to label those lesser-sized signed types. So in practice, int can be no more than 32 bits.

The solution is to always use size_t as a count or size of anything. Uses offset_t for a difference between size_t values, or a difference between pointer values.

An alternate practice (one I subscribe to) is to use unsigned as your default type instead of int. In this practice, 'int' means a variable which can be negative. In real programs, the vast majority of variables never contains negative values.

[1] char can be either signed or unsigned, so let's ignore it for this. Also, let's ignore the new char type names introduced in recent C++.

userbinator · on May 8, 2016

The other thing to consider is that "x86-64" or "amd64" is not really 64-bit. Operand sizes are 32 bits by default (except the addressing modes, which leads to this mess), and you can only use 64-bit widths with an extra prefix byte --- on each instruction. There are no 64-bit equivalents of some instructions. Compared to the 16-to-32-bit transition that happened with the 386, the 32-to-64 just does not seem anywhere near as well thought-out.

Then again, it makes sense to leave operand sizes at 32 bits, as otherwise code would be twice as big, with the accompanying consequences for caching etc.; in fact I'd be willing to bet that 4 billion representable values is more than a lot of applications would ever need to put in an int. Extending that to 64 bits is really for addressing and the other cases where that exponentially larger range is truly needed.

ryuuchin · on May 8, 2016

How much of an issue is this really though?

Isn't the worst case a 1 cycle penalty for using a 32-bit int on x86-64 in the manner described in the article?

witty_username · on May 8, 2016

But that's repeated in the loop. So, looks like around a 10-20% performance change. Quite a bit for something not obvious.

ryuuchin · on May 8, 2016

Perhaps in a really tight loop but the counter is still going to be several iterations ahead thanks to the out of order capabilities of the CPU. You may also have other things in play as well such as the micro-op cache depending on the code size of the loop.

brigade · on May 8, 2016

The worst case is extending the counter each time it's changed, which is once per loop. It's not going to be a measurable difference unless the loop is tight enough to take like 3 cycles per iteration or less.

askee · on May 8, 2016

I find the wording somewhat peculiar: 'the resulting address will change dramatically, by (4 - 2^(32+2)). In 32-bit mode (all addresses modulo 2^32), this is just an increase by 4 as usual; the "wraparound" part is a multiple of 2^32 and hence invisible'

Why would it matter that the address will change "dramatically"? Even if it didn't (and only change by say 16 bytes or whatever) you'd have to deal with it. The problem is that int overflows at 2^31 while long/a larger type doesn't. In order to cope with the overflow you'll have to "simulate" an overflow in the larger type as well (or at least do what the smaller type does) which is achieved via a sign-extend.

nkurz · on May 8, 2016

I've recently been diving into low-level loop optimization on current Intel x64 processors. It's a rabbit hole, but makes me think there are considerably better optimizations that compilers could be making. Here are some thoughts on this "naive" loop example as they would execute on Skylake, with reference to sections in the (excellent) Intel Architectures Reference Manual: http://www.intel.com/content/dam/www/public/us/en/documents/...

First, since the µops have already been decoded, instruction length doesn't really matter. Also, the limiting factor in this loop is not the execution ports, but the "renamer". While we can supply 6 µops per cycle [2.1.1] and dispatch and execute up to 8 µops per cycle [2.2.2] (limited by instruction mix) only 4 (possibly fused) µops can be moved from the Decoded Instruction Queue to the Scheduler per cycle. [2.3.3.1]

Using "xor %reg -> %reg" for zeroing a register is considered a special "zero idiom" (along with several other forms), and does not need an execution port [2.3.3.1]. It is handled in the renamer with zero latency. It does however take up one of the 4 available renamer slots, but since it's outside the loop here this doesn't matter.

The "test %ecx -> %ecx; jng done" sequence is "macro-fused" to a single µop. It's a single both for the renamer and for execution. These instructions need to be consecutive for this to happen. Again, outside the loop so no real difference.

At the top of the loop, "movsxd %ebx -> %rdx" is a single µop, 1 cycle latency, and can execute on any of the 4 arithmetic ports. While it can't be substituted here, it's interesting to note that if sign extension is not needed, "mov %ebx -> %edx" and "movzx %bl -> %edx" are specially treated as a "zero latency move": 0 latency, no execution port, but do use a renamer slot.

The add from memory "add [%rsi+%rdx*4] -> %eax" is internally treated as a single "micro-fused load op" when stored in the Decoded Instruction Queue. But because it uses an addressing mode with an index register, it is "un-laminated" as it is transferred to the scheduler [2.3.2.4], and thus uses two renamer slots. Since we only have 4 available slots per cycle, this can be a major reason to prefer a non-indexed address like "add [reg+offset] -> eax" as in his unrolled example.

"inc %ebx" can perform poorly as an instruction, since it updates only part of the flag register (leaves "carry" unmodified). While it doesn't stall post-Sandy Bridge, in some cases it may require an extra µop to merge the flag register. Intel suggests avoiding it, and using "sub $1 -> %ebx" instead [3.5.2.6].

The "cmp %ecx, %ebx; jl lp" pair are fused and treated as a single µop. This is good, and loop termination should almost always keep these instructions consecutive for this reason. Even better would be to arrange the loop so that one can update the loop counter and then check the flags resulting from the arithmetic. The usual way to do this is to count down to zero instead of up to a maximum: "sub $1 -> %ebx; jnz lp". Possibly worth noting is that fusion with add/sub does not occur with jno (jump not overflow) or jns (jump not sign), so terminating at zero is usually the best choice.

(Note that I intend this as supplementary information to the article, and not a criticism. While I agree with CJefferson and cokernel_hacker that loop termination is a more important reason for undefined overflow, I think the technical details in the article are correct.)

akkartik · on May 8, 2016

There's been a lot of discussion about undefined behavior on HN recently, and I'm gradually starting to question if the conventional approach to compilers might be a giant dead-end. In this case, our conventional languages provide no way for a programmer to say "I care about overflow". Because face it, the vast majority of programs a compiler sees are written extremely sloppily. Most of the time there's this huge disconnect between the amount of effort the programmer and the ghost of the compiler writer are putting into considering such situations.

Now, most of the time this disconnect is a good thing. Division of labor ensures that the work of the compiler writer is amortized across many many programmers. But the discussion around undefined behavior is a sign that the load on the compiler writer has gotten too onerous, to the point that everybody is paying a cost (in reduced performance, in increased compiler complexity leading to reduced compiler stability, in compilers antagonistically interpreting standards like lawyers) for a level of correctness along one narrow dimension that only a tiny fraction of users actually care about.

What if we instead said, "compilers will never ever consider overflow" and let those who care about overflow fend for themselves in the source language? Among other things that would allow integers to always be the word size of the computer they're running on.

(My interest in these questions stems from trying to program in an idealized assembly language for the past year: http://github.com/akkartik/mu. After writing 10kLoC of assembly to build a simple text editor (http://akkartik.name/post/mu) I'm surprised to find it quite ergonomic. High level languages today seem to provide three kinds of benefits over traditional/ugly assembly: expressiveness (e.g. nested expressions, classes), safety (e.g. type checking) and automation (e.g. garbage collection). An idealized assembly language gives up some expressiveness, but doesn't seem to affect the other benefits. As such, it's at an extreme end of the trade-off I describe. I'm hoping my experience will largely translate even when Mu eventually implements a high-level language.)

Edit 21 minutes later: For reference, here's how I've been thinking about undefined behavior while building Mu: https://github.com/akkartik/mu/blob/5dd82a8a70/001help.cc#L4.... Comments most welcome.

mempko · on May 8, 2016

With every additional compiler check, you reduce the set of possible programs. It would be interesting if the checks could be easily programatically defined and inserted by those who care. I think Perl6 is probably the best modern compiler that allows this.

I agree, it should not be the compilers job to decide the set of possible programs. The hardware guys figured this out long ago. Afterall, CPU registers are untyped.

akkartik · on May 8, 2016

Can you elaborate on what checks Perl6 permits programmers to ask for?

raiph · on May 9, 2016

Aiui you can have any checks you want.

Aiui, simplifying, every operation is a function call so you just write a function body that corresponds to what you want the operation to be with the checks you want.

jmgao · on May 8, 2016

> In this case, our conventional languages provide no way for a programmer to say "I care about overflow".

GCC and clang have builtin operations that return whether they overflowed (bool __builtin_add_overflow(T a, T b, T* out), etc.). Similarly, Rust has wrapping, checked and saturating versions of all of the operations as methods on the integer types.

nkurz · on May 8, 2016

In this case, our conventional languages provide no way for a programmer to say "I care about overflow".

What if we instead said, "compilers will never ever consider overflow" and let those who care about overflow fend for themselves in the source language?

Does specifying an an unsigned type like size_t (as the author suggests in the last paragraph) achieve most of this? It's slightly backward, in that it's telling the compiler that overflow is not explicitly prohibited, but I think it usually has the same effect of preventing optimizations based on the assumption that signed overflow is not allowed.

akkartik · on May 8, 2016

Yeah, it's probably as reasonable a hack as you can hope to find in the context of C. The thing to watch out for is that you'd end up with different semantics in arithmetic operations. Right-shift wouldn't sign-extend, for example, that's the first one that comes to mind.

Weirdly enough, in my project above I ended up eschewing unsigned ints and just relying on signed numbers everywhere. I have to take a big hit on performance by adding a couple of asserts that consistently enable this without undefined behavior (and I might still be missing something), but performance hasn't been an issue so far. I just don't want to think about whether I might want to perform subtraction down the road when I create a new int. "Unsigned integer" is an oxymoron since integers are defined to include negative numbers.

nkurz · on May 8, 2016

"Unsigned integer" is an oxymoron since integers are defined to include negative numbers.

That's an odd definition of an oxymoron. While "integer" if unmodified does include negative numbers, isn't "unsigned" just an adjective, restricting the general category to the subset to the non-negative subset? For example, is "white swan" an oxymoron because swans can be both black and white?

Interestingly, my first reaction reading your linked code was that not all "integer overflow" is undefined, only signed integer overflow. Regardless of technical correctness, I thought about suggesting that it would read more clearly if you specified "signed" and "unsigned" rather than using "integer" to refer to what programmers refer to as "signed".

akkartik · on May 8, 2016

Yes, both are fair points. "Oxymoron" was a lazy jab I just came up with on the fly.

You're right that only signed ints admit undefined behavior. Unfortunately you can't get by using just unsigned ints. So for me the choice was between mixing signed/unsigned and just using signed everywhere. If use of signed ints is unavoidable and also leads to undefined behavior, I'd like to exercise it more often in hopes of learning faster what things I shouldn't be doing. This seems like the more antifragile choice (http://www.amazon.com/Antifragile-Things-That-Disorder-Incer...).

pepijndevos · on May 8, 2016

Are there any example of how big projects deal with this? Like, using int64_t and friends.

I'm also surprised that there does not seem to be an option to tell GCC to use 64 bit ints if this really produces better code.

https://gcc.gnu.org/onlinedocs/gcc-4.5.3/gcc/i386-and-x86_00...

GlitchMr · on May 8, 2016

If you could change meaning of int to be 64-bit type, then your code ABI (https://en.wikipedia.org/wiki/Application_binary_interface) would be essentially incompatible, and you wouldn't be able to use code (such as standard library, or other headers provided by operating system libraries) that was compiled on a compiler where int was 32-bit, as for instance wrong structure fields would be read.

If you would try to "solve" that by assuming headers use 32-bit int definition, then you would break calling your own code, as you would use 32-bit int definition provided in headers, while your code was called with 64-bit int convention.

It's probably just easier to define your own int meaning type, say, a shorter alias of int_fast32_t.