As I learned from Haskell, the correct answer really ought to depend on the domain. I ought to be able to use an unsigned integer to represent things like length, for instance.
However, since apparently we've collectively decided that we're always operating in ring 256/65,536/etc. instead of the real world where we only operate there in rare exceptional circumstances [1], instead of an exception being generated when we under or overflow, which is almost always what we actually want to have happen, the numbers just happily march along, underflowing or overflowing their way to complete gibberish with nary a care in the world.
Consequently the answer is "signed unless you have a good reason to need unsigned", because at least then you are more likely to be able to detect an error condition. I'd like to be able to say "always" detect an error condition, but, alas, we threw that away decades ago. "Hooray" for efficiency over correctness!
[1]: Since this seems to come up whenever I fail to mention this, bear in mind that being able to name the exceptional circumstances does not make those circumstances any less exceptional. Grep for all arithmetic in your choice of program and the vast, vast majority are not deliberately using overflow behavior for some effect. Even in those programs that do use it for hashing or encryption or something you'll find the vast, vast bulk of arithmetic is not. The exceptions leap to mind precisely because, as exceptions, they are memorable.
If your domain is 0..99 for example, using unsigned and <100 is nicer than using signed and two conditionals. Detecting overflows is easier, not harder.
In c, unsigned behavior is also less UB.
And while the enforcement isn't there, it is still better to communicate your intent with the more precise type.
> the correct answer really ought to depend on the domain
This is correct. However, if you use C, your domain is bits and bytes, not abstract things — because this is what C is created to work with. I love using appropriate abstraction levels for different domains, but if your actual problem has nothing to do with systems programming, you probably shouldn't use C in the first place.
In theory, I agree. I mean that seriously, not as a rhetorical sop. However, in practice, it had wrought endless havoc even in C, suggesting, well, exactly my current feelings, which is that it's really insane in 2015 to use C for much of anything. C is the modern-day programmer's SCP-572 [1], except the sharp blow to the head doesn't usually work on a C programmer.
We have not "decided that we're always operating in ring 256/65,536/etc". We decided that being a little more efficient can save a room in the computer's building. If correctness becomes harder, well, there's an acceptable price. The programmer is always in the building anyway.
An obsolete decision, sure, but I wouldn't call it incorrect.
This begs the question, though, why don't we revert it in new architectures, already incompatible with everything. Tradition, I guess?
> A) exception/panic on overflow: this is preferable to overflow but also undesirable behavior.
I don't find this undesirable at all. I have worked in languages where this is the default (Standard ML) and it's quite nice. You can always explicitly use a word-type if you actually want overflow. With hardware support, it is even fairly cheap to implement, as you just connect the overflow bit of your integer ALUs to something interrupt-triggering.
> With hardware support, it is even fairly cheap to implement, as you just connect the overflow bit of your integer ALUs to something interrupt-triggering.
You can exploit the IEEE-754 inexactness exception with doubles if you're okay with 53 bits of integer precision. Tim Sweeney showed me this trick a long time ago, which he combined with run-time code patching so that an arithmetic operation would be speculatively compiled as either fixnum or bignum based on observed run-time behavior. It'd compile it to fixnum by default with an overflow trap using the FP inexact exception. The trap would patch the fixnum site with the equivalent bignum code and resume. With hindsight inspired by modern trace compilers like LuaJIT, the better approach is probably to compile entire traces speculated on either case, so you can hoist most of the guard predicates and not speculate based solely on the proximate call site but on longer traces.
I can't get past the hubris involved with this post. First, did you truly need Haskell to learn that number representations have pitfalls? Look back in history. Problems with numbers did not begin with computers. Nor will they end with them.
Second, you imply that this would have been an easily solvable problem. If only those that came before you hadn't been so willing to shout "Hooray" and make a choice you disagree with.
I don't think you have to have respect for those that came before you. But blithely ignoring their context when they made these choices is... rude. At best. (It also ignores that many early computers did not use 2's compliment numbers, so it was possible to keep adding to bigger and bigger numbers, you would just run out of space and cause other problems.)
Finally, over/under flow are pretty much endemic into all physical things. Ever accidentally pour too much water into a cup? Why didn't you pick the correct cup for the domain of liquid you were dealing with? Overfill a laundry machine? Run out of ink in the pen you were using? Reach for a sheet of paper to find you had none left? ...
Unsigned integers have a big "cliff" immediately to the left of zero.
Its behavior is not undefined, but it is not well-defined either. For instance, subtracting one from 0U produces the value UINT_MAX. This value is implementation-defined. It's safe in that the machine won't catch on fire, but what good is that if the program isn't prepared to deal with the sudden jump to a large value?
Suppose that x and y are small values, in a small range confined reasonably close to zero. (Say, their decimal representation is at most three or four digits.) And suppose you know that x < y.
If x and y signed, then you know that, for instance, x - 1 < y. If you have an expression like x < y + b in the program, you can happily change it, algebraically to x - b < y if you know that overflow isn't taking place, which you often do if you have assurance that these are smallish values.
If they are unsigned, you cannot do this.
In the absence of overflow, which happens away from zero, signed integers behave like ordinary mathematical integers. Unsigned integers do not.
Check this out: downward counting loop:
for (unsigned i = n - 1; i >= 0; i--)
{ /* Oops! Infinite! */ }
Change to signed, fixed! Hopefully as a result of a compiler warning that the loop guard expression is always true due to the type.
Even worse are mixtures of signed and unsigned operands in expressions; luckily, C compilers tend to have reasonably decent warnings about that.
Unsigned integers are a tool. They handle specific jobs. They are not suitable as the reach-for all purpose integer.
> Suppose that x and y are small values, in a small range confined reasonably close to zero. (Say, their decimal representation is at most three or four digits.)
There's the rub! You need to justify characterizing your values in that way, which means either explicit range checks and assertions, or otherwise deriving them from something that applies that guarantee in turn. And that justification is more work than just making your code correct for every value. I mean, if x and y are both smallish, is x*y also smallish?
The "nearby cliff" is a good thing, in that it makes errors come out during testing rather than a month after you ship. Handwaving about "reasonably close to zero" is begging for trouble.
In the absence of automatic bigints, unsigned integers are easier to make correct.
The most dramatic example is the INT_MIN/-1 case, since that causes an outright crash.
For example, Windows has the ScaleWindowExtEx function, which scales a window by some rational number, expressed as the ratio of two ints. Using signed arithmetic is already suspicious: what does it mean to scale a window by a negative fraction? But of course they forgot about the INT_MIN/-1 case, and the result is a BSOD. http://sysmagazine.com/posts/179543/
Every single one of those cases involve integers that necessarily must be signed, in order for the interface to work. They're implementing an interpreted language having signed integers, or in the one other case, with ScaleWindowExtEx, it manipulates values that are signed, and that have meaning when negative. The one place I saw an INT_MIN / -1 bug (in code review, iirc) was also for an interpreted language implementing a modulo operator. These are bugs with signed integer usage, but they're not bugs caused by a decision to use signed integers, because in these cases, there was no choice. They aren't representative of what you see when you do have a choice, say, using signed integers for array indices and sizes.
The question of what bugs you actually see was meant to be a personal one, not one of some dramatic bug you've read about on the internet. The answer of which is the better choice is defined by how much damage is caused by one choice versus the other, and you get that answer by noting how frequently you get bugs in practice as a result of such decisions. (Not that thought experiments and imagination no place, but this is clearly a question for which you can talk yourself into any direction.) For example, I've never had problems with C/C++ undefined behavior of signed integer overflow, while you're spending a lot of time talking about that. I have seen bugs caused by unsigned and signed integer usage that fit into other categories, though.
Without signed integers, the scaling function will turn arguments like (-4, -3) to garbage.
In order to disallow negatives, you need signed arguments, or else to reduce the range. (Say the numerator and denominator cannot exceed 255 or whatever). Otherwise the function has no way of knowing whether argument values (UINT_MAX-3, UINT_MAX-2) are a mistaken aliasing of -4, -3 or deliberately chosen positive values.
> but what good is that if the program isn't prepared to deal with the sudden jump to a large value
Aaaand you only need to care about this case.
For an unsigned int just check: x < (your max value)
For signed ints: x < (your max value) AND x > 0
Oh, the downward counting loop example. Because that's done very frequently no? I really don't remember when was the last time I did that (I'd much rather have an upwards loop then y = MAX_VALUE - x (adding 1 if needed)
Quite funnily, if you do a loop in x86 assembly, it is naturally downwards counting, if ECX is zero the loop ends
Don't use a for, just i = (n - 1); do { i--;} while (i);
A key difference is whether the type is meant to be an index (counting) or for arithmetic. Indexing with unsigned integers certainly has its pitfalls:
while (idx >= 0) { foo(arr[idx]); idx--; }
But this is outweighed by the enormous and under-appreciated dangers of signed arithmetic!
Try writing C functions add, subtract, multiply, and divide, that do anything on overflow except UB. It's trivial with unsigned, but wretched and horrible with signed types. And real software like PostgreSQL gets it wrong, with crashy consequences: http://kqueue.org/blog/2012/12/31/idiv-dos/#sql
Well, there are builtin functions that do this safely for signed integers.
Of course that means you need to be aware which calculations are potentially dangerous, you need to know about signed integer overflow problems in the first place, etc. ..
Unsigned overflow is defined but that doesn't necessarily make it any more expected when it happens. It can still ruin your day, and in fact it happens much more often than signed overflow, because it can happen with seemingly innocent subtraction of small values. After personally fixing several unsigned overflow bugs in the past few months I'm going to have to side with the Google style guide on this one.
If you underflow an unsigned you get a huge value that's going to be invalid by any range check unless you assume entire unsigned range is available.
But if you use signed, you have to check for values under zero AND the top of the range. So I don't understand how the bugs were caused by unsigned overflow. Wouldn't they still be bugs with negative results in signed numbers?
Mostly the errors are expecting that a calculation is using signed arithmetic when it's not, and getting a large positive value where you expected a small negative value. It's easy to forget the signedness of a variable, but the biggest problem is probably C's unfortunate tendency to silently convert signed to unsigned when at least one unsigned value is involved in a calculation. This causes unsigned arithmetic to contaminate parts of your code that you thought were signed.
I think I have a handful of negative numbers in hundreds of thousands of lines of C code. It's quite rare, IME.
By the way, an underflowed/large unsigned number used when a signed number is expected in most contexts (e.g: when adding as an offset) will behave correctly. It will fail when you try to compare it using < or >. If you use enable gcc warnings (which you ought to for every conceivably useful warnings :-) ), that will be caught.
Despite the fact that signed overflow/underflow is undefined behavior I'm pretty sure that many more bugs have resulted from unsigned underflow than signed overflow or underflow. When working with integers you're usually working with numbers relatively close to zero so it's very easy to unintentionally cross that 0 barrier. With signed integers it is much more difficult to reach those overflow/underflow limits.
But you have to balance the probability of a problem with its severity. I do a lot of work in C with pointers and arrays, and there are many situations where I select unsigned ints because (, although problems are more likely,) there are less likely to cause a memory read or write error (for my application, though not necessarily for others). I also often use unsigned when I am doing math that includes logs, because I am less likely to cause a math error.
I would like to be clear that I agree with your assessment for most situations, with some exceptions.
There are a lot of corner cases involved here. The corner cases for signed integers involve undefined behavior, the corner cases for unsigned integers involve overflow. Any time you mix them you get unsigned integers, which can give you big surprises.
For example, if z is unsigned, then both x and y will get converted to unsigned as well. This can cause surprises when you expected the LHS to be negative, but it's not, because the right side is unsigned, and that contaminates the left side.
if (x < y * z) { ... }
This particular case gets caught by -Wall, but there are plenty of cases where unintended unsigned contagion doesn't caught by the compiler. Of course, if you make x long, then y * z will be unsigned, then widened to long, which gives you different results if you are on 32-bit or if you are on Windows. Using signed integers everywhere reduces the cognitive load here, although if you are paranoid, you need to do overflow checking which is going to be a bear and you might want to switch to a language with bigints or checked arithmetic.
As another point, in the following statement, the compiler is allowed to assume that the loop terminates:
for (i = 0; i <= N; i++) { ... }
Yes, even if N is INT_MAX. The way I think of it, your use of "signed" means that you are communicating (better yet, promising) to the compiler that you believe overflow will not occur. In these cases, the compiler will usually "do the right thing" when optimizing loops like this, where it sometimes can't do that optimization for unsigned loop variables.
So I'm going to disagree as a point of style. Signed arithmetic avoids unintended consequences for comparisons and arithmetic, and enables better loop optimizations by the compiler. In my experience, this is usually correct, and it is rare that I actually want numbers to overflow and do something with them afterwards.
All said, if you don't quite get the subtleties of arithmetic in C (yes, it is subtle) then your C code is fairly likely to have errors, and no style guide is going to be a panacea.
In fact, prior to Java 8, you could not even declare unsigned integers.
In Java 8 you still can't declare unsigned integers. It just that API have been
extended to provide operations that treat an int or a long as having unsigned
value.
This seems to imply that Java leaves the behaviour of signed integer overflow
up to the underlying hardware, but guarantees a two's complement
representation even if the architecture does not. It seems reasonable to
expect that it would simply wrap to 0 (I wasn't able to find a more
conclusive reference for this).
In fact Java Specification does specify what happens in case of overflow very
precisely. For example for multiplication (section 15.17.1):
If an integer multiplication overflows, then the result is the low-order bits
of the mathematical product as represented in some sufficiently large
two's-complement format.
And for addition (section 15.18.2):
If an integer addition overflows, then the result is the low-order bits of the
mathematical sum as represented in some sufficiently large two's-complement
format.
Java approach without undefined behaviour would seem to be quite convenient for
a programmer, but it does not seem to be the case in practice. Of course you
gain ability to check for overflow after performing the operation, which is
nice, but at the same time loose the straightforward way to detect bugs
provided by undefined behaviour (UB).
If an application executes UB, then it is obviously wrong, thus a simple
instrumentation of arithmetic operation followed by a check for overflow gives
a way to detect such problems without false positives (see for example
-fsanitize=undefined and similar options available in modern C/C++ compilers).
IMHO only very small fraction of overflow bugs would have been fixed by just
wrapping result around.
Of course you gain ability to check for overflow after performing the operation,
Using exceptions?
IMHO only very small fraction of overflow bugs would have been fixed by just wrapping result around.
Yes, switching to wrapping integers is not the solution to overflow. Rather check the operation beforehand. I still think using unsigned integers is always preferred if you don't need negative values.
What I had in mind, is the fact that with defined overflow you could
unconditionally perform the operation, and then observe the result to decide if
overflow have in fact happened. For example, following pattern to check if
adding 100 to an integer 'a' overflows would be valid:
int a = ...;
if (a + 100 < a) overflow
So this is one of cases where wrapping behaviour could potentially fix bugs in
existing applications. Obviously, rewriting this to work with overflow is
trivial if you are already aware of undefined behaviour that could occur.
Suppose somebody have written above code. Then under current C++ standard it
would invoke UB on overflow. But, if standard were to change and require
wrapping behaviour for int, then it would be "fixed", i.e, do what programmer
intended it to do. Similarly if you use some compiler specific option to ensure
wrapping behaviour for signed integers like GCC -fwrapv.
While the signed vs unsigned int question doesn't concern me much[1], I really appreciate this post because I discovered the One Page CPU. The author nerd-sniped me three paragraphs in. Now I want to try to implement this on an FPGA. I think my next few weekends will be taken up with getting a real world implementation working and getting some examples working. Thanks.
[1] If I was designing a system that could only have one type of int (but why?) I'd use unsigned ints and if I needed to represent negative numbers, I'd use two's complement which is fairly well behaved, much as the author points out. This is fairly common in the embedded world.
That seems like a false sense of safety: when doing arithmetic, your program is most likely not going to be handling overflows correctly anyway. Also, when using unsigned ints, any code involving subtractions can lead to subtle bugs that only occur when one value is larger than another. I'd recommend sticking with signed ints for all arithmetic code.
It might be pretty safe if you're only ever adding, multiplying or dividing with other unsigned numbers. Once you start doing subtraction it could get ugly.
Use signed because pointer differences in C are signed:
char *a, *b;
size_t foo = a - b; /* generates a warning with -Wconversion */
This tells me that C's native integer type is ptrdiff_t, not size_t. (And I agree this is crazy: unsigned modular arithmetic would be fine, but they chose a signed result for pointer subtraction).
Why care about this? You should try to get a clean compile with -Wconversion, but also you should avoid adding casts all over the place (they hide potential problems). It's cleaner to wrap all instances of size_t with ptrdiff_t- you can even check for signed overflow in these cases if you are worried about it.
There is another reason: loop index code written to work properly for unsigned will work for signed, but the reverse is not true. This means you have to think about every loop if you intend to do some kind of global conversion to unsigned.
g << h Well defined because 2147483648 can be represented in a 32 bit unsigned int
Even under those assumptions, it is implementation defined if unsigned int can hold the value 2^31. It is perfectly valid to have an UINT_MAX value of 2^31-1. In that case the code will cause undefined behavior.
The only guarantee for unsigned int is that it's UINT_MAX value is at least 2^16-1, regardless of its bit size, and that it has at least as much value bits as a signed int.
Because properly handling overflow if you're using signed values involves UB, it's super hard to get it right. Even Apple has struggled with it: http://www.slashslash.info/2014/02/undefined-behavior-and-ap...
If you get it wrong you can also see your program's behavior change with the compiler optimization level, which is incredibly frustrating (and causes a lot of "the compiler is buggy!" beliefs).
For what it's worth, all the bugs I've seen related to integer representation (not just over/under, e.g. I've seen code casting a 32bit pointer to a 64bit signed integer) could have been fixed by changing int to unsigned and never the opposite. Of course, this could be just the effect of the vast majority of programmers choosing int as the default integer type.
I've seen people 10 years my senior cast a 32-bit pointer to an int32_t. It's funny because it's not difficult to get right, but nobody ever seems to bother.
Use the most appropriate type. If you don't know what that is, use `int`. My general rule of thumb is "use unsigned integers when interfacing with hardware, otherwise use int.". Sometimes you need int64_t. You usually don't.
The problem with unsigned integers is that most people, past me included, think that it prevents nonsense negative values. It doesn't. The compiler will quite gladly let you call a `void func(unsigned x)` with -1 and let you have "fun" later debugging why who and how this function got called with an enormous value.
And, of course, that's not the only source of negative numbers that get passed in by mistake. More often it's the result of a subtraction that can "never" be negative.
All of the mathematical operations are defined on unsigned int in C. This is not true of int.
Compilers are now starting to do all manner of nasty optimizations on undefined behavior. It is now only a matter of time before an signed int burns you.
No matter if you use signed or unsigned types, be sure you handle both underflow and overflow. From my experience, signed integers main usage is integer arithmetic (e.g. accounting things that can be negative or positive), pointer arithmetic, error codes, or multiplexing on same variable two different kind of elements (so if negative has one meaning and if positive, a different one). Unsigned, for handling countable elements, with 0 as minimum. The good point of using unsigned integers is that you have twice the available codes, because the extra bit.
Example for counting backwards using unsigned types (underflow check, -1 is 111...111 binary in 2's complement representation, so -1 is equivalent to the biggest unsigned number):
size_t i = ss - 1, j = sso;
for (; i != (size_t)-1; i--) {
switch (s[i]) {
case '"': j -= 6; s_memcpy6(o + j, """); continue;
case '&': j -= 5; s_memcpy5(o + j, "&"); continue;
case '\'': j -= 6; s_memcpy6(o + j, "'"); continue;
case '<': j -= 4; s_memcpy4(o + j, "<"); continue;
case '>': j -= 4; s_memcpy4(o + j, ">"); continue;
default: o[--j] = s[i]; continue;
}
}
Example for compute percentage on unsigned integers (same idea can be applied for signed) without losing precision nor requiring bigger data container (overflow checks):
However, since apparently we've collectively decided that we're always operating in ring 256/65,536/etc. instead of the real world where we only operate there in rare exceptional circumstances [1], instead of an exception being generated when we under or overflow, which is almost always what we actually want to have happen, the numbers just happily march along, underflowing or overflowing their way to complete gibberish with nary a care in the world.
Consequently the answer is "signed unless you have a good reason to need unsigned", because at least then you are more likely to be able to detect an error condition. I'd like to be able to say "always" detect an error condition, but, alas, we threw that away decades ago. "Hooray" for efficiency over correctness!
[1]: Since this seems to come up whenever I fail to mention this, bear in mind that being able to name the exceptional circumstances does not make those circumstances any less exceptional. Grep for all arithmetic in your choice of program and the vast, vast majority are not deliberately using overflow behavior for some effect. Even in those programs that do use it for hashing or encryption or something you'll find the vast, vast bulk of arithmetic is not. The exceptions leap to mind precisely because, as exceptions, they are memorable.