Not a question really, but a comment for anyone involved: please push to add integer overflow traps. No processor adds support because C doesn't require them, and as a result no language detects integer overflow by default because the processors make it slow. We need to break this cycle and it's not often that a new processor architecture comes around.
This has been discussed before in the RISC-V community. See https://lists.riscv.org/lists/arc/hw-dev/2014-09/msg00007.ht... (sorry, you'll probably have to click the "I'm not a spammer link" first time and then load the link again - the riscv mailing lists seem to use the most painful mailing list software and Gmane wasn't archiving the riscv lists at that time).
One reason for push-back on implicit overflow checking is that it complicates superscalar designs by adding another exception source. The good news is that with an open ISA like RISC-V with high quality reference implementations, we can finally actually perform meaningful experiments to test these assumptions - adding different overflow checking semantics to a realistic implementation, quantifying the difference when putting it through the ASIC flow for a real process and making the matching changes to the compiler. It seems ridiculous that us in the computer architecture community haven't had the ability before.
There is no chicken and egg problem for most workloads. Processors are quite good at handling correctly predicted branches, and overflow checks will be correctly predicted for basically all reasonable code. In the case where the branch is incorrectly predicted (because of an overflow), you likely don't care about performance anyway.
See http://danluu.com/integer-overflow/ for a quick and dirty benchmark (which shows a penalty of less than 1% for a randomly selected integer heavy workload, when using proper compiler support -- unfortunately, most people implement this incorrectly), or try it yourself.
People often overestimate the cost of overflow checking by running a microbenchmark that consists of a loop over some additions. You'll see a noticeable slowdown in that case, but it turns out there aren't many real workloads that closely resemble doing nothing but looping over addition, and the workloads with similar characteristics are mostly in code where people don't care about security anyway.
People who are actually implementing new languages disagree. Look at the hoops Rust is jumping through (partially) because they don't feel comfortable with the performance penalty of default integer overflow checks: https://github.com/rust-lang/rfcs/pull/146
TL;DR: There exists a compiler flag that controls whether or not arithmetic operations are dynamically checked, and if this flag is present then overflow will result in a panic. This flag is typically present in "debug mode" binaries and typically absent in "release mode" binaries. In the absence of this flag overflow is defined to wrap (there exist types that are guaranteed to wrap regardless of whether this compiler flag is set), and the language spec reserves the right to make arithmetic operations unconditionally checked in the future if the performance cost can be ameliorated.
Yeah, I think Rust has probably made the right decision here, but it's frustratingly imperfect. This introduces extra divergence in behavior between debug and release mode, which is never good.
Note that there's even pushback in this thread about enabling overflow checks in debug mode due to performance concerns...
I'm hopeful that as an industry we're making baby steps forward. Rust clearly wants to use checked arithmetic in the future; Swift uses checked arithmetic by default; C++ should have better support for checked arithmetic in the next language revision. All of these languages make heavy use of LLVM so at the very least we should see effort on behalf of the backend to reduce the cost of checked arithmetic in the future, which should hopefully provide additional momentum even in the potential absence of dedicated hardware support.
If you read the thread, you'll see that the person who actually benchmarked things agrees: someone implemented integer overflow checks and found that the performance penalty was low, except for microbenchmarks.
If you click through to the RISC-V mailing list linked to elsewhere in this discussion, you'll see that the C++17 standard library is planning on doing checked integer operations by default. If that's not a "performance focused language", I don't know what is.
> the C++17 standard library is planning on doing checked
> integer operations by default
In C++, wrapping due to overflow can trivially cause memory-unsafe behavior, so it's a pragmatic decision to trade off runtime performance for improved security. However, Rust already has enough safety mechanisms in place that integer overflow isn't a memory safety concern, so the tradeoff is less clear-cut.
Note that the Rust developers want arithmetic to be checked, they're just waiting for hardware to catch up to their liking. The Rust "specification" at the moment reserves the right to dynamically check for overflow in lieu of wrapping (Rust has long since provided types that are guaranteed to wrap for those occasions where you need that behavior).
> someone implemented integer overflow checks and found
> that the performance penalty was low, except for
> microbenchmarks.
I was part of that conversation back then, and the results that I saw showed the opposite: the overhead was only something like 1% in microbenchmarks, but around 10% in larger programs. (I don't have a link on hand, you'll have to take this as hearsay for the moment.)
The benchmark I see says up to 5% in non-microbenchmarks. A 5% performance penalty is not low enough to be acceptable as the default for a performance-focused language. If you could make your processor 5% faster with a simple change, why wouldn't you do it?
Even if the performance penalty was nonexistent in reality, the fact is that people are making decisions which are bad for security because they perceive a problem, and adding integer overflow traps will fix it.
As someone who's spent the majority of their working life designing CPUs (and the rest designing hardware accelerators for applications where CPUs and GPUs aren't fast enough), I find that when people say something like "If you could make your processor 5% faster with a simple change, why wouldn't you do it?", what's really meant is "if, on certain 90%-ile or 99%-ile best case real-world workloads, you could get a 5% performance improvement for a significant expenditure of effort and your choice of a legacy penalty in the ISA for eternity or a fragmented ISA, why wouldn't you do it?"
And the answer is that it there's a tradeoff. All of the no-brainer tradeoffs were picked clean decades ago, so all we're left with are the ones that aren't obvious wins. In general, if you look at a field an wonder why almost no one has done this super obvious thing for decades, maybe consider that it might be not so obvious after all. At zurn mentioned, there are actually a lot of places where you could get 5% and it doesn't seem worth it. I've worked at two software companies that are large enough to politely ask Intel for new features and instructions; checked overflow isn't even in the top 10 list of priorities, and possibly not even in the top 100.
In the thread you linked to, the penalty is observed to be between 1% and 5%, and even on integer heavy workloads, the penalty can be less than 1%, as demonstrated by the benchmark linked to above. Somehow, this has resulted in the question "If you could make your processor 5% faster ...". But you're not making your processor 5% faster across the board! That's a completely different question, even if you totally ignore the cost of adding the check, which you are.
To turn the question around, if people aren't willing to pay between 0% and 5% for the extra security provided, why should hardware manufacturers implement the feature? When I look at most code, there's not just a 5% penalty, but a one to two order of magnitude penalty over what could be done in the limit with proper optimization. People pay those penalties all the time because they think it's worth the tradeoff. And here, we're talking about a penalty that might be 1% or 2% on average (keep in mind that many workloads aren't integer heavy) that you don't think is worth paying. What makes you think that people would who don't care enough about security to pay that kind of performance penalty would pay extra for a microprocessor that gives has this fancy feature you want?
> people aren't willing to pay between 0% and 5% for the extra security provided
This is not true. One problem is that language implementations are imperfect and may have much higher overhead than necessary. An even bigger problem is that defaults matter. Most users of a language don't consider integer overflow at all. They trust the language designers to make the default decision for them. I believe that most people would certainly choose overflow checks if they had a perfect implementation available, and perfect knowledge of the security and reliability implications (i.e. knowledge of all the future bugs that would result from overflow in their code), and carefully considered it and weighed all the options, but they don't even think about it. And they shouldn't have to!
For a language designer, considerations are different. Default integer overflow checks will hurt their benchmark scores (especially early in development when these things are set in stone while the implementation is still unoptimized), and benchmarks influence language adoption. So they choose the fast way. Similarly with hardware designers like you. Everyone is locally making decisions which are good for them, but the overall outcome is bad.
> if people aren't willing to pay between 0% and 5% for
> the extra security provided
In the context of Rust, integer overflow checks provide much less utility because Rust already has to perform static and dynamic checks to ensure that integers are used properly, regardless of whether they've ever overflowed (e.g. indexing into an array is a checked operation in Rust). So as you say, there's a tradeoff. :) And as I say elsewhere in here, the Rust devs are eagerly waiting for checked overflow in hardware to prove itself so that they can make it the default and do away with the current compromise solution (which is checked ops in debug builds, unchecked ops in release builds).
There are areas where you could make a typical current processor "up to 5%" faster in exchange for dumping various determinism features provided in hardware that are conductive to software robustness in the same way as checked arithmetic. For example the Alpha had imprecise exceptions and weak memory ordering. The consensus seems to be against this kind of tradeoff.
This RFC was the result of a long discussion that took place in many forums over the course of several years, so it's tricky to summarize. Here's my attempt:
1. Memory safety is Rust's number one priority, and if this were a memory safety concern then Rust's hands would be tied and it would be forced to use checked arithmetic just as it is forced to use checked indexing. However, due to a combination of all of Rust's other safety mechanisms, integer overflow can't result in memory unsafety (because if it could, then that would mean that there exists some integer value that can be used directly to cause memory unsafety, and that would be considered a bug that needs to be fixed anyway).
2. However, integer overflow is still obviously a significant cause of semantic errors, so checked ops are desirable due to helping assure the correctness of your programs. All else equal, having checked ops by default would be a good idea.
3. However however, performance is Rust's next highest priority after safety, and the results of using checked operations by default are maddeningly inconclusive. For some workloads they are no more than timing noise; for other workloads they can effectively halve performance due to causing cascading optimization failures in the backend. Accusations of faulty methodology are thrown around and the phrase "unrepresentative workload" has its day in the sun.
4. So ultimately a compromise is required, a new knob to fiddle with, as is so often the case with systems programming languages where there's nobody left to pass the buck to (and you at last empathize with how C++ got to be the way it is today). And there's a million different ways to design the knob (check only within this scope, check only when using this operator, check only when using this type, check only when using this compiler flag). In Rust's case, it already had a feature called "debug assertions" which are special assertions that can be toggled on and off with a compiler flag (and typically only enabled while debugging), so in lieu of adding any new features to the language it simply made arithmetic ops use debug assertions to check for overflow.
So in today's Rust, if you compile using Cargo, by default you will build a "debug" binary which enables checked arithmetic. If you pass Cargo the `--release` flag, in addition to turning on optimizations it will disable debug assertions and hence disable checked arithmetic. (Though as I say repeatedly elsewhere, Rust reserves the right to make arithmetic unconditionally checked in the future if someone can convincingly prove that their performance impact is small enough to tolerate.)
There isn't as strong a need for ASan in Rust because so little code is unsafe. Most of the time, the only reason you drop down to unsafe code is because you're trying to do something compilers are bad at tracking (or that is a pain in the neck to encode to a compiler). It's usually quite well-contained, as well.
You can work with uninit memory, allocating and freeing memory, and index into arrays in Safe Rust without concern already (with everything but indexing statically validated).
IMHO the kind of stuff `unsafe` is used for is very conducive to aggressive automated testing.
I don't know: both compilers are trying to use 'undefined use' to optimize code (and too bad for you if this create problems for your application), so your explanation isn't coherent..
My explanation is: lack of interest/money: security is the thing that is always ignored, see the lack of funding of OpenSSL until recently..
has or had? I think that some MIPS variant deprecated the trap on overflow, if this is the case then gcc & clang behaviour is logical if not then it's un-comprehensible (especially since at the same time they justify f.. up your executable for 'optimisation' purpose if you have an undefined in your code).
"If you try to insert a number into an integer constant or variable that cannot hold that value, by default Swift reports an error rather than allowing an invalid value to be created. This behavior gives extra safety when you work with numbers that are too large or too small."
Also consider instructions for efficient atomic reference counting, with traps on both inc (overflow) and dec.
In particular, they can have weaker ordering semantics and they can be buffered and elided among themselves (obviously with some sort of inter-core snooping).
And possibly support for "tagged" numbers, e.g. add integers if high bit is not set, call function otherwise, same for floats if not NaN, with a predictor for them.
> Also consider instructions for efficient atomic reference counting, with traps on both inc (overflow) and dec.
Atomic reference counting is slow and gets even worse the more CPU cores and especially CPU sockets you have. If you can afford to make as expensive operation as an atomic add, you can definitely afford to add overflow checks. Atomic add is 50-1000+ clock cycles depending on contention, core/socket count and "moon phase" -- latency is somewhat unpredictable.
> In particular, they can have weaker ordering semantics and they can be buffered and elided among themselves (obviously with some sort of inter-core snooping).
I'm not sure how weak ordering semantics and fetch-and-add (atomic add) could mix. Aren't atomics about strong ordering by definition? Maybe there's something I don't understand.
> And possibly support for "tagged" numbers, e.g. add integers if high bit is not set, call function otherwise, same for floats if not NaN, with a predictor for them.
You'd still get branch mispredict which I guess you're trying to avoid. There'd be no performance improvement.
If you want to get better performance out of dynamic language implementations that use NaN-tagging, you'll likely get better performance by adding one instruction that performs an indirect 64-bit load using 52-bit or 51-bit NaN-tagged addresses. The instruction should probably contain an immediate value for a PC-relative branch if the value isn't a properly formatted NaN-tagged address.
All languages would benefit from instructions to more efficiently support tracing of native code. A pair of special purpose registers (trace stack and trace limit registers) to push all indirect and conditional branch and call targets would really speed up tracing of native code a la HP's Project Dynamo. Presumably upon trace stack overflow the processor would trap to the kernel or call to userspace interrupt vector entry.
A small pseudorandom number generator and another pair of special purpose registers (stack and limit register) for probabilistically sampling the PC would make profiling lighter weight, both for purposes of human analysis of code and also for runtime optimization in JITs or HP Dynamo-like native code re-optimization.
Tagging is done to carry information about data type. Like to mark that float64 is actually a 32-bit integer.
Traps (CPU exceptions, such as traditional FPU exceptions like division by zero) usually involve kernel mode context switch. So if you trap on tag, the performance for tagged values will probably be 3-5 orders of magnitude slower. That's a lot.
> Traps (CPU exceptions, such as traditional FPU exceptions like division by zero) usually involve kernel mode context switch.
Could you explain why?
I thought that trapping was more like a 'slow branch': slow due to the flush the pipeline but why should the kernel be involved(1)?
1: except if you need to swap in a page, but that's just like any other memory reference.
When we're talking about x86, that's true in ring 0. Otherwise first thing CPU does is to enter privileged, ring 0 mode, save registers, jump through interrupt vector table and process the trap in kernel code. Trap handler will probably need to check usermode program counter and take a look at the instruction that caused the trap. No hard data, but I think we're talking about 1-5 microseconds.
Runtime/language exceptions have different mechanisms that don't require kernel context switches (but might involve slow steps like stack walk).
Lots of languages transparently promote to bignuns of various sorts when integer overflow occurs, clang and gcc expose builtins for checking overflow, etc. The only part here that is missing is the implementation of traps.
http://blog.regehr.org/archives/1154