A Guide to Undefined Behavior in C and C++ (2010)

jbandela1 · on Aug 17, 2023

If you want some nice examples of how undefined behavior results in weirdness, see https://devblogs.microsoft.com/oldnewthing/20140627-00/?p=63...

An interesting example from there is how the compiler can turn

    int table[4];
    bool exists_in_table(int v)
    {
        for (int i = 0; i <= 4; i++) {
            if (table[i] == v) return true;
        }
        return false;
    }

Into:

    bool exists_in_table(int v)
    {
        return true;
    }

JonChesterfield · on Aug 17, 2023

The amazing part about examples like that is people read them, check that the compiler really does work on that basis, and then continue writing things in C++ anyway. Wild.

Suppose I should expand on this. The idea seems to be either 1/disbelief - compilers wouldn't really do this or 2/ infallibility - my code contains no UB.

Neither of those positions bears up well under reality. Programming C++ is working with an adversary that will make your code faster wherever it can, regardless of whether you like the resulting behaviour of the binary.

I suspect rust has inherited this perspective in the compiler and guards against it with more aggressive semantic checks in the front end.

peppermint_gum · on Aug 17, 2023

>The amazing part about examples like that is people read them, check that the compiler really does work on that basis, and then continue writing things in C++ anyway. Wild.

Well, in modern C++ this code would look like this:

    std::array<int, 4> table;
    bool exists_in_table(int v)
    {
        for (auto &elem : table) {
            if (elem == v) return true;
        }
        return false;
    }

Or even simpler:

    std::array<int, 4> table;
    bool exists_in_table(int v)
    {
        return std::ranges::contains(table, v);
    }

There's no shortage of footguns in C++, but nonetheless, modern C++ is safer than C.

mike_hock · on Aug 17, 2023

Weirdly, GCC fails to optimize this, but Clang does (if you make the table static as in the original example).

gizmo686 · on Aug 17, 2023

I actually would prefer to get the second output. The result is wrong, but consistantly and deterministically so. The naive implementation of the broken code is a heisenbug. Sometimes it will work, and sometimes it won't, and any attempt to debug it would likely perterb the system enough to make the issue not surface.

It wouldn't suprise me if I have run into the latter situation without relizing it. When I got the the problem, I would have just (incorrectly) assumed that the memory right after the array happened to have the relevent value. I would be counting my blessings that it happened consistantly enough to be debuggable.

jll29 · on Aug 17, 2023

I agree that it is better to get deterministic and predictable behavior.

Reminds me of when for a while, I worked on HP 9000s under HP-UX and in parallel on an Intel 80486-based Linux box, and what I noticed is that the Unix workstations crashed sooner and more predictably with segmentation faults than Linux on the PC (not sure if this has changed since the early 1990s - probably had to do with the MMU); so developing on HP under Unix and then finally compiling under Linux led to better code quality.

jandrewrogers · on Aug 17, 2023

> The amazing part about examples like that is people read them, check that the compiler really does work on that basis, and then continue writing things in C++ anyway.

That isn't idiomatic C++ and hasn't been for a long time. Sure, it's possible to do it retro C-style, because backward compatibility, but you generally don't see that in a modern code base.

JonChesterfield · on Aug 17, 2023

The modern codebase has grown from a legacy one. The legacy one with parts of the codebase that were C, then got partially turned into object oriented C++, then partially turned into template abstractions. The parts least likely to have comprehensive test coverage. That place is indeed where a compiler upgrade is most likely to change the behaviour of your application.

jcelerier · on Aug 19, 2023

every day new greenfield projects start in C++ - nowadays, 20, 23...

uecker · on Aug 18, 2023

There are plenty of similar things in C++. I do not think C++ is safer than C. std::vector does not do bounds by default checking last time I checked.

Arech · on Aug 19, 2023

And thank G-d it doesn't do it!

TwentyPosts · on Aug 20, 2023

Remind me, how is this a good thing again? Especially considering that (if you write modern C++) the compiler should optimize away bound checks most of the time (and in all critical places) either way.

Arech · on Aug 20, 2023

This is excellent thing and here is why:

> the compiler should optimize away bound checks most of the time (and in all critical places)

Unfortunately, this is true much rarer than you might think. In order to optimize it away, a compiler has to prove that the bounds check result is always true and that happens surprisingly not all the time to say the least. When it can't optimize it away, bounds checking will slow down the code significantly, especially in tight loops. And that slowdown will be very hard to debug unless you know exactly where to look for, - you'll basically assume that "that's how it works at the highest speed and it can't be improved (a.k.a. "buy a better hardware!")

Second, with a proper programming hygiene, in many cases bounds checking are just redundant. There are at least 2 methods for direct iteration over a vector, that doesn't require it: ranged `for (auto& e: vector){}` and by utilizing iterators. There are also `<algorithms>` library with implementation of many useful container iteration functions that require you at most to only specify a functor that do some operation on vector element.

And third - if you think that you really must have bounds checking, it's about as trivial to implement as:

  template<typename T, typename A>
  class vectorbc : public ::std::vector<T,A>{
    public:
    using std::vector::vector;

    T& operator[](std::size_t idx){
         return this->at(idx);
    }
  }

6345dhjdsf · on Aug 19, 2023

Genuninely curious, why do you say g_d not god?

_eojb · on Aug 17, 2023

It's just as "amazing" to read these takes from techno purists. You use software written in C++ daily, and it can be a pragmatic choice regardless of your sensibilities.

erik_seaberg · on Aug 17, 2023

And we have the core dumps to prove it.

When any Costco sells a desktop ten thousand times faster than the one I started on, we can afford runtime sanity checks. We don’t have to keep living like this, with stacks that randomly explode.

_eojb · on Aug 18, 2023

I don't know what line of work you're in, but I use a desktop orders of magnitude faster than my first computer also, and image processing, compilation, rendering, and plenty of other tasks aren't suddenly thousands of times faster. Not to mention that memory safety is just one type of failure in a cornucopia of potential logical bugs. In addition, I like core dumps because the failure is immediate, obvious, and fatal. Finally, stacks don't "randomly explode." You can overflow a stack in other languages also, I really just don't see what you're getting at.

dureuill · on Aug 18, 2023

> Not to mention that memory safety is just one type of failure in a cornucopia of potential logical bugs.

You can die of multiple illnesses so there's no point in developing treatment for any particular one of them.

> I like core dumps because the failure is immediate, obvious, and fatal.

Core dumps provide a terrible debugging experience, as the failure root cause is often disjoint from the dump itself. Not to mention that core dumps are but one outcome of memory errors, with other funnier outcomes such as data corruption and exploitable vulnerabilities as likely.

Lastly, memory safe languages throw an exception or panic on out of bound access, which can be made as immediate and fatal as core dumps. And much more obvious to debug, since you can trust that the cause starts indeed at the point of failure

erik_seaberg · on Aug 18, 2023

I don’t mean a call stack, I mean “stack” in the LAMP sense—the kernel, drivers, shared libraries, datastores, applications, and display servers we try to depend on.

patrick451 · on Aug 18, 2023

I dunno, my computers seems to keep running slower and slower despite being faster and faster. I blame programmers increasingly using languages with more and more guardrails which are slower. I'd rather have a few core dumps and my fast computer back.

JonChesterfield · on Aug 17, 2023

Definitely. There's loads of value delivered by C++ implementations, including implementations of C++ and other languages. The language design of speed over safety mostly imposes a cost in developer / debugging time and fear of upgrading the compiler toolchain. Occasionally it shows up in real world disasters.

I think we've got the balance wrong, partly because some engineering considerations derive directly from separate compilation. ODR no diagnostic required doesn't have to be a thing any more.

johnbellone · on Aug 17, 2023

But it isn’t Rust.

jacquesm · on Aug 17, 2023

Lots of things 'aren't Rust'. In fact almost everything isn't Rust. For now. That may change in due course but right now I would guestimate the amount of Rust code running on my daily drivers to pretty close to zero%. The bulk is C or C++.

angiosperm · on Aug 17, 2023

Hardly anything is. Literally none of the programs on my machine are coded in Rust. (Firefox is reputed to have a bit in it.)

dralley · on Aug 18, 2023

If you're running Windows 11, you have Rust running in your kernel. And also some userspace system libraries.

https://www.bleepingcomputer.com/news/microsoft/new-windows-...

Someone else posted statistics that show Firefox being 10% Rust, but I'm not sure it makes sense to include HTML and Python and JavaScript in the comparison. If you compare Rust against C/C++, it's 20%

angiosperm · on Aug 18, 2023

Who runs Windows?

There is no such language as C/C++. I presume you mean the sum of C and C++, and that you omit external libraries from the tally.

dralley · on Aug 18, 2023

> I presume you mean the sum of C and C++

"I'd just like to interject for a moment. What you're refering to as Linux, is in fact, GNU/Linux, or as I've recently taken to calling it, GNU plus Linux. Linux is not an operating system unto itself, but rather another free component of a fully functioning GNU system made useful by the GNU corelibs, shell utilities and vital system components comprising a full OS as defined by POSIX."

Isn't this a tad pedantic? You obviously understood what I was saying.

>and that you omit external libraries from the tally.

Mozilla vendors their dependencies. They're counted.

jacquesm · on Aug 17, 2023

About FF and Rust:

https://news.ycombinator.com/item?id=30743577

_gabe_ · on Aug 17, 2023

> check that the compiler really does work on that basis, and then continue writing things in C++ anyway. Wild.

My compiler (MSVC) doesn't do that[0]. Clang also doesn't do this[1]. It's wild to me that GCC does this optimization[2]. It's very subtle, but Raymond Chen and OP both say a compiler can create this optimization, not that it will.

[0]: https://godbolt.org/z/bdx4EMzxe

[1]: https://godbolt.org/z/z833Wa391

[2]: https://godbolt.org/z/6b8aq59M9

Gibbon1 · on Aug 17, 2023

What's amazing is programmers haven't tared and feathered the standards committee and compiler writers for allowing crap like that.

saagarjha · on Aug 18, 2023

They've tried. Language lawyers are very good at their jobs, though.

uecker · on Aug 18, 2023

Well, the argument brought up is that users want it this way, so this is existing practice which is implemented and should be standardized. So please complain and file bugs.

Also use the compiler and language feature that help, such as variably-modified types such as bare pointers, attributes, compiler flags, etc.

Gibbon1 · on Aug 18, 2023

I think at this point unfortunately regulators needs to start getting involved.

fluoridation · on Aug 17, 2023

What's odd about that example is that the optimization is only valid if the loop in fact overflows the array every time. So the compiler is proving that the array is being overflowed and rather than emitting a warning to that effect, it generates absurd code.

kllrnohj · on Aug 17, 2023

> So the compiler is proving that the array is being overflowed and rather than emitting a warning to that effect

   <source>:5:13: warning: unsafe buffer access [-Wunsafe-buffer-usage]
      5 |         if (table[i] == v) return true;
        |             ^~~~~

https://godbolt.org/z/zGxnKxvz6

This one is weirdly hard to get a compiler warning out of which is a fair critque, but so many of the "Look what the compiler did to my UB silently!" issues are not at all silent and would have been stopped dead with "-Wall -Wextra -Werror"

iainmerrick · on Aug 17, 2023

As noted elsewhere in this thread, GCC by default does the "optimization" and doesn't warn. No doubt there are other examples where Clang is the one that misbehaves.

How are we supposed to know whether our code is being compiled sensibly or not, without poring over the disassembly? Just set all the warning flags and hope for the best?

UncleMeat · on Aug 17, 2023

I think that a big problem is that for every compile that seems "not sensible" and is actually not sensible, there are 100s or 1000s of compiles that would look absolutely insane to a human but are actually exactly what you want when you sit down and think about it for a long time.

Almost all of the "don't do the overly clever stuff!" proposals would throw away a huge amount of actually productive clever stuff.

fluoridation · on Aug 17, 2023

I think what the GP means by "not sensible" is that proving that the code is broken in order to silently optimize it more aggressively is not sensible. If your theorem proven can find a class of bugs then have it emit diagnostics. Don't only use those bugs to make the code run faster. Yes, make the code run faster, but let me know I may be doing something nonsensical, since chances are that it is nonsensical and it doesn't cost anything at run time.

mike_hock · on Aug 17, 2023

A warning is only useful if it prescribes a code transformation that affirms the programmer's intent and silences the warning (unless the warning was a true positive and caught a bug). You cannot simply emit a warning every time you optimize based on UB.

There is no `if(obvious out-of-bound access) silently emit nonsense har har har` in the compiler's source code. The compiler doesn't understand intent or the program as a whole. It applies micro transformations that all make sense in isolation. And while the compiler also tries to detect erroneous programming patterns and warn about those, that's exceedingly more difficult.

fluoridation · on Aug 17, 2023

>You cannot simply emit a warning every time you optimize based on UB.

And I'm not saying it should do that. I'm saying if the compiler is able to detect erroneous code, then it should emit a warning when it does so. An out of bounds access is an example of code that is basically always erroneous.

>There is no `if(obvious out-of-bound access) silently emit nonsense har har har` in the compiler's source code. The compiler doesn't understand intent or the program as a whole. It applies micro transformations that all make sense in isolation.

Yes, I understand that. However, like I said in my first response, this optimization in particular is only valid if the array is definitely accessed incorrectly. If the compiler is able to perform this optimization, there are only two possibilities: either the compiler can determine in some cases (and in this one in particular) that an array is accessed incorrectly and doesn't warn about it; or it can't determine that condition and this optimization is caused by a compiler bug and there are cases where the compiler incorrectly performs it, breaking the code. If the former is the case, then someone wrote the code to check whether an array is always accessed correctly. Either that, or nobody wrote it and the compiler deduces from even more basic principles that arrays must always be accessed by indices less than their lengths; which, I mean, that might be the case, but I seriously doubt it.

tialaramex · on Aug 18, 2023

> if the compiler is able to detect erroneous code

Today in most cases nobody is writing this code. Neither C nor C++ have any mandate for such detection.

There is a proposal, which could perhaps make it into C++ 26, to formally specify "erroneous behaviour" and have the compiler do something particular and warn you that what you're doing is a bad idea for the specified cases†, but it's easily possible that doesn't end up in the IS, or that compiler vendors aren't interested in implementing it. Meanwhile, if it happens at all it's up to the vendor.

† "Erroneous behaviour" is one possible approach to the uninitialized locals problem in C++. Once upon a time C says local variables can be declared and used without initializing them, this actually has Undefined Behaviour, which is very surprising for C and C++ programmers who tend to imagine that they're getting the much milder Unspecified Behaviour, but they are not. Many outfits use compiler flags to say look, when I do this, and I know sometimes I'll screw up, just give me zeroes, so that's Defined Behaviour, it's not Intended Behaviour but at least it's not Undefined. This includes all major OS vendors (Microsoft, Apple, Red Hat etc.)

Some people brought this approach to WG21, but there was pushback, if uninitialized variables are zero, then they're not really uninitialized are they? This has two consequences, 1. Performance optimisations from not initializing data evaporate; and 2. It is now "correct" to use this zero initialization behaviour because it's specified by the language standard, so maybe you can't lint on it.

Erroneous Behaviour solves (2) by saying no, it's still wrong, it's just safely wrong, the compiler can report this is wrong and it must ensure your results are zero.

Another proposal offers a syntax to solve (1) by saying explicitly in your program, "No, I'm a smart C++ programmer, do not initialize these values", akin to the markers like ~~~ you may have seen to mean "Don't initialize this" in some other languages.

muldvarp · on Aug 18, 2023

> However, like I said in my first response, this optimization in particular is only valid if the array is definitely accessed incorrectly.

No. The code does not show any undefined behavior if any of the elements of `table` is equal to `v`, because then the loop is ended by an early return. The compiler certainly did not prove that this code always has undefined behavior.

UncleMeat · on Aug 17, 2023

Right and the next part is the hard part: defining this clearly. What I'm saying is that there is a surprising amount of "wait, actually I do want that" when you dig into this proposal.

fluoridation · on Aug 17, 2023

A reasonable compiler would let you turn off a specific warning for a section of code.

torstenvl · on Aug 18, 2023

They pretty much all do.

#pragma clang diagnostic push

#pragma clang diagnostic ignored "-Wwhatever"

// code

#pragma clang diagnostic pop

fluoridation · on Aug 18, 2023

I was going to comment that GCC doesn't, but it seems it was added as some point since the last time I checked. I know at one time GCC had as a policy not to allow doing that.

torstenvl · on Aug 18, 2023

Here's an example:

https://pastebin.com/raw/fH0Lj2Zb

uecker · on Aug 18, 2023

I agree there should be a warning. But it is not trivial to teach a compiler when to warn or not to not generate too many false positives.

Not as good as warning, but UBSan catches this at run-time: https://godbolt.org/z/Mdjn7h8dj

moefh · on Aug 17, 2023

> whether our code is being compiled sensibly or not

I'm failing to see what's not sensible about how that code is compiled.

The only possible way that function could return false is if you read past the end of the array and the value there happens to be different from `v`. Is it really the more sensible to rely on that, rather than fixing a known behavior in case of array overflow?

robinsonb5 · on Aug 17, 2023

If the compiler's going to interpret undefined behaviour as license to do something that runs counter to the programmer's expectations, the most sensible course of action is for the compiler to yell very loudly about it instead of near-silently producing (differently!) broken code.

Currently that piece of code doesn't trigger a warning with -Wall. It's not even flagged with -Wextra - it needs -Weverything.

moefh · on Aug 17, 2023

One man's "broken code produced by the compiler" is another man's "excellently optimized code by the compiler".

Where to draw the line is not always clear, but here's a very clear-cut example[1] where emitting a warning would be bad. If you don't want to watch the video, it's basically this:

- the code technically contains undefined behavior, but it will never be actually triggered by the program

- changing the code to remove undefined behavior forces the compiler to emit terrible code

Making the compiler yell at the programmer in this case would be terrible, but it's clearly a consequence of what you're asking.

[1] https://youtu.be/yG1OZ69H_-o?t=2358

jeffbee · on Aug 18, 2023

Exactly. I think a lot of this noise is by non-practitioners of the language. The compiler is steel-manning this loop. It is generously interpreting the 4 as irrelevant, and deducing that the loop must always exit early. The author can’t possibly have meant to access beyond the end, because that’s not defined. QED. It seems altogether sensible to me.

Joker_vD · on Aug 18, 2023

Wow, I must congratulate you because this reads equally well both as a serious argument and as a parody of that argument.

So let me reply to your comment as if it were serious: yes, if the programmer by supernatural means knows that the "v" is always presented somewhere in the array, then this function works exactly as intended: it would always return true, and the compiler optimises it to do so as quickly as possible! But... perhaps there is some other way to pass such programmer's knowledge ("the arguments are guaranteed to be such that this loop is guaranteed to finish early") to the compiler in a more explicit way? Some sort of explicitly written assertion? A pre-condition? A contract, if you like?

See, it's very difficuly to maintain such unspoken contracts and invariants during the codebases' life because they're unspoken and unwritten. Comments barely count since compilers generally ignore them.

jeffbee · on Aug 18, 2023

Thanks! I think anyone would have to be nuts to write a loop like this in C++ or tolerate C as a language. C++'s `ranges::find` does what it says, and communicates between the author and the reader as well as the author and the compiler.

robinsonb5 · on Aug 18, 2023

> One man's "broken code produced by the compiler" is another man's "excellently optimized code by the compiler".

To be fair it's not the compiler's fault that the source program is broken - the argument is over whether the compiler is being helpful or being obtuse, and this particular case I'd argue the latter.

Thanks for the video link - it's an interesting example, but the crucial difference there, I think, is that in that case the compiler isn't doing something counter to the programmer's intent. The code isn't incorrect (assuming a non-pathological buffer size) - it's merely more convenient for the compiler when expressed with int32_t indices rather than uint32_t indices.

I do appreciate, though, that deciding what to yell about and what not to yell about is an extremely non-trivial problem.

bondant · on Aug 18, 2023

I am not sure this warning is proving any overflow, you can get the same warning by just accessing table[i].

https://godbolt.org/z/Gxd3rK9Ts

fanf2 · on Aug 17, 2023

No, the logic for the optimization is:

- a correct program does not access table[4]

- therefore the loop must always exit early

- the only way to exit early is to return true

toxicdevil · on Aug 18, 2023

    int table[4];
    bool exists_in_table(int v)
    {
        for (int i = 0; i <= 4; i++) {
            if (table[i] == v) return true;
        }
        return false;
    }

i -> 0 goto return true or next iter i -> 1 goto return true or next iter i -> 2 goto return true or next iter i -> 3 goto return true or next iter

i -> 4 goto return true or exit loop and return false Since the branch in on undef behavior it is okay for the compiler to choose any branch destination or none at all (i.e. remove all further code). The compiler in this case likely chose to just remove the branch and any destinations. All that the prior code does is return true, since there is no next iter, so thats all what is left.

tedunangst · on Aug 17, 2023

No, the compiler knows the array isn't overflowed, because C programs don't contain overflows. Therefore the loop must exit via one of the return true statements.

Joker_vD · on Aug 18, 2023

> What's odd about that example is that the optimization is only valid if the loop in fact overflows the array every time.

No, the optimization is valid if the function is always called with "v" that is actually exists in the table; in this case the function should always return true so it's only proper for the compiler to throw out the extraneous code.

And writing the loop in such way is programmer's promise/guarantee to the compiler that the function will indeed be called only in that manner. That's the essence of the UB: it's the programmer who promises to the compiler that she will perform all the necessary checks (or formal proofs of impossibilty) herself; the compiler may go forward and rely on the implied invariants and preconditions.

And this is, of course, the main problem of the UB because 95% of the time the programmer does not actually intend to make such a gurantee: she simply is unaware (for whatever reason) that there is a pre-condition (checked by nobody!) that's required for the program's correctness to hold... or it's even just a typo she made.

gpderetta · on Aug 18, 2023

No, the compiler is proving that the return true statement must be executed given the axiom that the loop cannot overflow.

This is tricky, because the code is perfectly valid if it always early-exit (and I have written code like this myself that avoids bound checking by guaranteeing an early exit, when micro-optimizing), so it is hard to statically reject it. On the other hand it seems a very obvious thing to warn on.

uecker · on Aug 18, 2023

Note that not all claims you find about UB on the internet are true. For example, in C, UB can not time-travel before observable behavior. And in general UB can not time-travel before any function call when the compiler can not show that the function returns to the caller (MSVC got this wrong though).

vkazanov · on Aug 18, 2023

This is a great example.

There's an obvious UB, the compiler sees it at compile time and should just stop the programmer from doing the mistake at all times.

Instead, it's totally possible to just let the programmer tear his leg off for no clear reason.

Animats · on Aug 17, 2023

The three big questions:

1. How big is it?

2. Who owns it?

3. Who locks it?

Most undefined behavior in C/C++ involves those three questions.

#1 is historically the most troublesome. And the most inexcusable. Pascal, which predates C, didn't have that problem, because arrays carried size info. Nor did Algol, Modula I, Modula II, and Modula III. Modula I was a very low level language - device registers were a language concept.

Something I wrote on this back in 2012.[1] There was some consensus at the time that this would work and would be backwards compatible with C. But it would be a tough sell, and I didn't want to spend my life selling it.

[1] http://animats.com/papers/languages/safearraysforc43.pdf

ndesaulniers · on Aug 17, 2023

-fbounds-safety

https://discourse.llvm.org/t/rfc-enforcing-bounds-safety-in-...

Part of the proposal is to make fat pointers. Has ABI implications, but there are many pressure relief valves.

chongli · on Aug 18, 2023

I read most of that proposal and I really like it! I hope they get it into the next standard.

In a way, it’s kind of a very narrow form of dependent types. It solves this one issue very cleanly.

JonChesterfield · on Aug 17, 2023

I think a C implementation with overhead instead of UB is implementable. I'd like to know what the fundamental performance delta we get from UB is. Likewise not sure it's the right choice for my life's work.

Quekid5 · on Aug 17, 2023

The MINIMUM baseline is probably somewhere around ASAN/UBSAN/etc. and those aren't exactly cheap... and they don't even promise to catch all the problems. The problem is that almost every single little thing you can do in C has potential for UB, even just the + operator.

So it would absolutely come at a HUGE performance cost, unfortunately.

More esoteric stuff is: If you do pointer arithmetic that technically goes out of bounds and then in bounds again... that's technically UB (can't remember if this is C++ only or both), so you can't rely on knowing where everything is + bounds checks.

CryZe · on Aug 17, 2023

I'd say that those are not actually the minimum. You don't need to detect undefined behavior, you just need to "make it implementation defined", so integers could just wrap around without needing to check for overflow, unitialized memory would just be random bytes, and so on. At least that would be one way to do it. Of course the checks would ensure the program shuts down more cleanly.

celrod · on Aug 18, 2023

The nice thing about integer overflow being UB is that UBsan (or even compiler flags without it) can make it throw, so it's easy to catch. Defining it as wrapping means you still have bugs, they're just harder to catch.

Define it as wrapping, and there will be programs relying on it (that are currently using unsigned).

twoodfin · on Aug 18, 2023

Aliasing is a real problem. If any pointer of any type can write though to any value of any other type with some defined behavior, many key optimizations—moving a value into a register!—become impossible.

JonChesterfield · on Aug 18, 2023

That's the general consensus.

However, much the same key optimisations are useful when both pointers have the same type. Given multiple float* pointers, you still want to know whether they alias for load/store forwarding and the like.

Once you've written the analysis to partition values of the same type into separate alias sets, that same analysis runs fine on values of different types.

Distinct types implying distinct alias sets does tolerate separate compilation well and is cheap to compute. The belief that it is the only way to achieve said key optimisations is worth some scepticism now that link time optimisation and interprocedural analysis are fairly common.

twoodfin · on Aug 18, 2023

You can have the most sophisticated analysis imaginable, but then I cast rand() to a pointer and …

Dylan16807 · on Aug 18, 2023

In general I don't know how you make currently-UB aliasing safe. But in that specific case you can just say that the value not showing up through the bad alias is a possible defined behavior.

twoodfin · on Aug 18, 2023

But whether it shows up or not is a function of the optimization decision, which might have gone the other way for any number of hard-to-define reasons.

Dylan16807 · on Aug 18, 2023

Is that a problem, though?

I would have thought the best case scenario for a bad alias is that you safely corrupt some other piece of data. (Assuming you let those writes go through.) So "it might not do that" seems like an improvement to me.

If you don't let those writes go through, then we're in a different happier situation, and the register optimization causes no change in behavior.

twoodfin · on Aug 18, 2023

I guess I'm not sure which definition you're proposing:

a) Writes through "bad" aliases never take effect.

or

b) Writes through "bad" aliases take effect sometimes, no guarantees.

I don't think either makes the point you're hoping to.

b) is just bog standard undefined behavior. I guess you could be intending to constrain the range of what effects compiler-emitted code can have in bad-aliasing situations (no nasal daemons!) but it's not clear how much additional optimization latitude constraining only nasal daemons provides. Compilers often save local stack space by reusing stack entries for multiple local variables. A bad alias to one could corrupt an unrelated value, and you've got nasal daemons again.

a) requires massive compile-time and run-time effort to dynamically distinguish "bad" aliases from "good" ones. This is along the lines of what valgrind does at great cost.

Dylan16807 · on Aug 18, 2023

I thought the premise was that Someone Else already took care of those nasal demons somehow, and we're just worried about putting the register optimization back into place. So the range of effects has already been constrained to something safe. We're just adding in "sometimes it doesn't have those bad effects" to enable some optimizations, and that should have very little downside.

Options a and b are just the different ways Someone Else could have implemented their solution. I'm not suggesting how they did that, I'm taking it as the premise. The problem of "how do we make bad aliasing safe" is much much much harder than "how do we still enable normal optimizations like this after we make bad aliasing safe".

Dylan16807 · on Aug 18, 2023

> More esoteric stuff is: If you do pointer arithmetic that technically goes out of bounds and then in bounds again... that's technically UB (can't remember if this is C++ only or both), so you can't rely on knowing where everything is + bounds checks.

When the goal is preventing UB, you can simply define things like that as not-UB. At least on the vast majority of architectures where the numbers work fine at runtime, and only the compiler sees it as invalid behavior.

saagarjha · on Aug 18, 2023

Not really. Address Sanitizer must work with the existing ABI, which means it has a lot of overhead for an imperfect result.

jenadine · on Aug 18, 2023

What could be done better if it could break the ABI?

saagarjha · on Aug 18, 2023

You could pass around fat pointers everywhere. This is more efficient than probabilistically setting up redzones.

Quekid5 · on Aug 24, 2023

True, but that doesn't address (heh) the briefly-goes-out-of-bounds-and-then-in-bounds scenario. Again, I'm not 100% that that's technically UB, but I think so?

(It would have be to be for very esoteric addressing modes, maybe relevant on ancient pre-linear address space architectures?)

saagarjha · on Aug 26, 2023

That would be a check on each arithmetic operation.

jll29 · on Aug 17, 2023

...and Ada, too. I like the idea of attributes of data objects, to access the size of x simple write x'Size (also for types e.g. Natural'Size).

The Wirth languages (from which Ada is also a descendant) were so much more readable than C, yet relatively capable for systems programming, as demonstrated by systems like TeX, MacOS, Wirth's Modula compilers and the OS for the Lilith workstation he co-designed from scratch.

Gibbon1 · on Aug 17, 2023

Never used Ada but I think you can define range types so int range 0...11. Which I feel is something that you really want in embedded and applications level programming.

tialaramex · on Aug 17, 2023

In the medium-long term I want to do this for Rust as "Pattern types" because the thing I actually want (custom types with niches) is gated on Pattern types, as the way to explain to the type system where the niche goes is a Pattern. I was persuaded that we can't/ shouldn't just say we'll half ass it, we must do it properly if we're doing it.

e.g. I don't necessarily have a use for an integer from 0 to 11, but I do see a use for BalancedI8, a one byte type with values -127 to +127 via 0, thus omitting -128. I reckon lots of people don't need -128, whereas a niche is very useful. Rust provides NonZeroI8, which has -128 through +127 but no zero, but I find that's less often what you want, and it's not today possible to make your own in stable Rust (and in nightly Rust you need a not-for-mortals perma-unstable attribute today).

veber-alex · on Aug 17, 2023

You can make it on stable Rust by using an enum with a missing variant. The compiler will give you a niche automatically.

Of course that doesn't scale to larger types as even with a macro the compile speed will be horrible.

tialaramex · on Aug 18, 2023

Yeah, "enum with a missing variant" is like, awful but maybe viable for BalancedI8, but it's clearly insane for BalancedI32 let alone BalancedI64 and those seem at least as useful.

Today, in practice, if you want BalancedI32 what you'd do is swizzle NonZeroI32, so that a trivial CPU operation converts between the two types and you can deliver almost the same optimisations in practice. But, I think that's also pretty ugly, and Rust clearly could do the nicer thing here, indeed it does in my crate that only builds on nightly and is using not-for-public-use compiler internal attributes. So that's what I want, just apparently not enough to spend my vacation time last month working on it. Maybe next month.

thesuperbigfrog · on Aug 17, 2023

>> Never used Ada but I think you can define range types so int range 0...11.

Yes. Ada supports integral types with custom ranges:

https://learn.adacore.com/courses/intro-to-ada/chapters/stro...

wahern · on Aug 18, 2023

From a tutorial at https://www.modula2.org/tutor/chapter11.php:

> The ALLOCATE procedure requires 2 arguments, the first of which is a pointer which will be used to point to the desired new block of dynamically allocated menory, and the second which gives the size of the block in bytes. The supplied function TSIZE will return the size of the block of data required by the TYPE supplied to it as an argument. Be sure to use the TYPE of the data and not the TYPE of the pointer to the data for the argument.

So what happens if you pass the wrong value as the second argument to ALLOCATE? Does the compiler throw an error, or are you now risking a buffer overflow?

C has arrays. Arrays carry size. The problems in C are much more nuanced than people seem to suggest, and people exaggerate the extent to which other languages do better in this regard, though to be sure C has significant deficiencies.

Joker_vD · on Aug 18, 2023

Hm, the proposal is rather sensible but... didn't C99 introduce static-dimensions in array-typed function parameters? I'm pretty sure

    void copybyref(size_t n, int a[static n], const int b[static n]) {
        for (int i = 0; i < n; i++) {
            a[i] = b[i];
        }
    }

is valid C and has exactly the same semantics as the example from the proposal — except that in this case, "no diagnostic is required" to ensure that there are indeed (at least) n elements in both a and b arrays.

saagarjha · on Aug 18, 2023

Checking this in general has ABI considerations. It allows the compiler to do some things at compile-time but beyond that it doesn't really work.

Joker_vD · on Aug 18, 2023

> Checking this in general has ABI considerations

Does it? Inserting a sanity check at every call site of e.g. memcpy (that neither of dst/src are NULL) is already kinda required for correctness even if people skip it and boldly go.

winrid · on Aug 17, 2023

#4 which is partly #2 - what thread is this callback being invoked in? The calling thread? A thread pool in the library?

Mostly a problem I have in java libraries, though.

Joker_vD · on Aug 17, 2023

> Case 2: (b == 0) || ((a == INT32_MIN) && (b == -1))

> A Java compiler, in contrast, has obligations in Case 2 and must deal with it (though in this particular case, it is likely that there won’t be runtime overhead since processors can usually provide trapping behavior for integer divide by zero).

Actually, there will be runtime overhead on x86/x64: Java mandates that Integer.MinValue / (-1) evaluates to Integer.MinValue (see 15.17.2. "Division Operator /" of the Java Language Specification) but IDIV instruction raises #DE in such circumstance. So the JITter actually emits

        cmp  eax, 0x80000000
        jne  .normalCase
        xor  edx, edx
        cmp  $reg, -1 
        je   .specialCase
    .normalCase:
        cdq
        idiv $reg
    .specialCase:

code sequence as you can see in its source ([0][1]) instead of simplistic "cdq; idiv $reg": because it does not want trapping behaviour in this particular case; but e.g. AArch64 doesn't trap neither division by zero nor INT_MIN / -1. That's why accurately implementing your language's semantics on different platforms is so annoying and why C standard left itself a nice shortcut.

[0] https://github.com/openjdk/jdk/blob/d27daf01d6361513a815e783...

[1] https://github.com/openjdk/jdk/blob/d27daf01d6361513a815e783...

fluoridation · on Aug 17, 2023

On the other hand, C left the burden of implementing portable semantics to its users.

Joker_vD · on Aug 17, 2023

Yes, but when C was being made, the application-level programmers knew the quirks of the platforms they used just as well as the compiler writers because they were almost precisely the same people.

zer8k · on Aug 17, 2023

> In the long run, unsafe programming languages will not be used by mainstream developers, but rather reserved for situations where high performance and a low resource footprint are critical.

I see no world where so-called "unsafe" languages would not be used. Most graduates of Computer Science programs can, perhaps with some trouble, implement a half decent C compiler in a weekend or two. This is not a footnote. This fact alone means that for any given piece of hardware you're more likely to find a random C compiler you can use than anything else. Rust, being the most likely contender to replace it, still cannot self-host and the grammar is exponentially more complicated than C. It is more like C + <whatever> will co-exist peacefully than something like C being replaced (even ignoring the millions of lines of code that already exist). Not for performance reasons but more that you can churn out a C compiler quickly for almost anything given a spec of the hardware.

On topic, I find a desk reference for this is very useful. The CERT C standard is pretty good to thumb through even if you don't adhere to every suggestion.

ladberg · on Aug 17, 2023

Eh, I don't disagree that unsafe languages will continue to be used, but I disagree with ease of compiler design as the reason.

You are comparing one of the easier languages to write a compiler for (C) with one of the hardest (Rust), and that's not due to UB but due to other facets of the languages. I could make up a new language that's equivalent to C in every way except replace all UB with defined behavior and it wouldn't make the naive compiler any different.

Additionally, writing a compiler for a language should really be a thing that happens only a handful of times while executing the code happens trillions of times so I hope we don't sacrifice safety to save compiler authors some work.

jenadine · on Aug 18, 2023

> except replace all UB with defined behavior and it wouldn't make the naive compiler any different.

I wonder how you will define use after free

adgjlsfhk1 · on Aug 18, 2023

returns a random value or errors. the main problem with UB isn't that it has unpredictable behavior, it's that it also inserts an unreachable that allows the compiler to assume it doesn't happen.

colejohnson66 · on Aug 18, 2023

Clarifying UB as “inserting a ‘__builtin_unreachable()’” actually just cleared up a lot of confusions I’ve had!

patrec · on Aug 17, 2023

> Most graduates of Computer Science programs can, perhaps with some trouble, implement a half decent C compiler in a weekend or two.

Where "most" of course means < 0.1%.

TillE · on Aug 17, 2023

Two weekends is ludicrously optimistic even if you're leaning heavily on existing parser libraries, but if you've taken a programming languages class, you can write a simple compiler for a large subset of C.

dralley · on Aug 17, 2023

> Rust, being the most likely contender to replace it, still cannot self-host

What do you mean, "still cannot self-host?"

You say that like it's a critical failure of the Rust project that they need and are attempting to address rather than a trivia item. Rust is perfectly happy relying on LLVM just like (checks notes) half the other languages in existence.

Libraries like LLVM are precisely what the comment you quote is talking about.

I'm not even sure that's true, anyway, with the cranelift backend. Someone can chime in on whether it's good enough for bootstrapping.

merlincorey · on Aug 17, 2023

Self Hosting your own compiler traditionally was the "end-game" of making a compile-able language. It's a sort of proof of fitness that the language can literally stand on its own.

This article about Zig achieving self-hosted status in 2022[0] points out that they gained many advantages at the cost of a lot of time and effort through this process. Incidentally, they decided to self-host while also supporting LLVM because of deficiencies in LLVM (mainly speed and target limitations). This flexibility includes a separate "C" backend to compile Zig to C in order to target for example game consoles that require a specific C compiler be used.

> You say that like it's a critical goal of the Rust project rather than a trivia item.

In my opinion, you are overly minimizing the potential benefits to Rust and the Rust community for Rust to be self-hosted.

Of course, practically, right now it doesn't matter because most people are more than happy to use the already working system.

[0] https://kristoff.it/blog/zig-self-hosted-now-what/

dralley · on Aug 17, 2023

As I said, the cranelift backend exists, and it provides many of the same benefits such as improved compilation speed. And it's written in Rust.

But it still feels like a trivia item. C compilers written in C exist, but almost nobody actually uses them. They use GCC, Clang, and MSVC, written in C++. Everybody knows that it's possible to self-host C, so the benefit of actually doing so in practice is minimal.

It's obviously possible to write a Rust compiler in Rust end-to-end. Acting like it's a second tier language because actively doing so not a top focus of the community is gatekeep-y and ridiculous.

zer8k · on Aug 17, 2023

Except gluing yourself to LLVM has it's own problems. Like, for example, any platform that LLVM doesn't support you can't support either. LLVM is great. The monoculture and smug elitism it produces is not.

> Acting like it's a second tier language because actively doing so not a top focus of the community is gatekeep-y and ridiculous.

It is probably one of the major reasons we won't see a Rust compiler shipped with an operating system for a very long time. That doesn't make it second tier. However, Rust fans seem to want to stick their head in the sand when their baby is criticized. I am a Rust (language) fan myself. I am just willing to criticize the language. I do not understand why the Rust community has such a volatile response to honest, valid, criticism.

dralley · on Aug 17, 2023

>It is probably one of the major reasons we won't see a Rust compiler shipped with an operating system for a very long time.

Even most linux distros don't ship with GCC out of the box... much less MacOS and Windows with their respective compilers.

If your standard is "Gentoo and FreeBSD will never ship it out of the box" then I'm going to 100% stand by my statement that this is weird and gatekeep-y.

Especially when the Windows kernel and userspace system libraries both have Rust in them.

https://www.bleepingcomputer.com/news/microsoft/new-windows-...

https://www.thurrott.com/windows/282471/microsoft-is-rewriti...

tialaramex · on Aug 17, 2023

> we won't see a Rust compiler shipped with an operating system for a very long time.

I can't figure out what this constraint means.

My Windows laptop doesn't seem to have provided a C compiler, so, maybe that's a problem for Windows?

Huh, well I guess I can buy or download a third party compiler, that's easy enough, but then, I can do that for Rust too, so, doesn't seem like a difference.

Meanwhile on this Fedora machine, the Rust compiler came with the OS. So, is this not an operating system? Maybe the stuff it comes with isn't "shipped with" it somehow? And so there's no C compiler "shipped with" this operating system either, although GCC was installed too ? I just don't know what to make of such a criticism.

learn-forever · on Aug 17, 2023

it's a ridiculous criticism, and the insult doesn't make it less ridiculous

merlincorey · on Aug 17, 2023

> Acting like it's a second tier language because actively doing so not a top focus of the community is gatekeep-y and ridiculous.

Here's where I think you are quite a bit off target, personally.

I certainly was not and I don't believe the GP you originally responded to was saying that "Rust is a second tier language due to [lack of self-hosted compiler]", so hopefully we can set that statement aside and ignore it now.

Let's instead focus on your first statement, which is directly related to what GP and I were arguing:

> It's obviously possible to write a Rust compiler in Rust end-to-end.

It is certainly possible but actually doing so is completely non-obvious because the grammar for Rust is much more complicated than C, and Rust has no formal language specification (let alone an international standard).

While Python does not have an international standard, it does have a formal language specification, which is what allows for things like PyPy to exist.

Meanwhile, to truly understand Rust, one must be an expert in C and learn the `rustc` code base.

It seems like, practically, knowing C and being able to write compilers in C is quite useful if you want to make an impact in Rust or maybe try your hand at making some future Rust replacement (hopefully with a language specification that others can follow).

dralley · on Aug 17, 2023

> It is certainly possible but actually doing so is completely non-obvious because the grammar for Rust is much more complicated than C, and Rust has no formal language specification (let alone an international standard).

The Rust compiler frontend is written in Rust. It doesn't matter how non-trivial writing a Rust frontend is if you can restrict the problem domain to writing a new backend for the existing compiler frontend.

And you can. As it stands there is the LLVM backend that everyone is familiar with, the GCC backend which is nearing completion, and the Cranelift backend which is written in Rust.

Zig is similar. Yes, they are going to replace LLVM by default, but they're not getting rid of their LLVM backend entirely. The main difference between Rust and Zig here is a matter of defaults, where Rust defaults to using LLVM while Zig will default to their self-hosted compiler.

> Meanwhile, to truly understand Rust, one must be an expert in C and learn the `rustc` code base.

Are you under the impression that the "rustc" codebase is written in C/C++? It is not... It uses LLVM, yes, but it's written in Rust.

> I certainly was not and I don't believe the GP you originally responded to was saying that "Rust is a second tier language due to [lack of self-hosted compiler]", so hopefully we can set that statement aside and ignore it now.

The discussion started with the statement that Rust will never replace unsafe languages without the ability to self-host, and then continued with the statement that "Self Hosting your own compiler traditionally was the "end-game" of making a compile-able language. It's a sort of proof of fitness that the language can literally stand on its own."

I don't think that was a completely unfair reading of these statements. The implication is that Rust is "not a fit language" because it "cannot stand on its own" and therefore "will never replace unsafe languages".

zer8k · on Aug 17, 2023

> I don't think that was a completely unfair reading of these statements. The implication is that Rust is "not a fit language" because it "cannot stand on its own" and therefore "will never replace unsafe languages".

I didn't intend this. The primary gripe I had was the grammar being complicated (and to be fair...not really available in an easy way). That means the places we are most likely see such bare metal shenanigans may not adopt it because they can't draft a XYZ Co. Compiler. This is a semi-common pattern with chip manufacturers.

The conversation diverged after that. Self-hosting is simply a signal that a language is "strong enough to stand on its own". That doesn't mean non-self hosted languages are bad. It just means you still need something else to bootstrap it. In the land of bare metal stuff like this matters.

kibwen · on Aug 18, 2023

> The primary gripe I had was the grammar being complicated

But C's grammar is so weird that it requires the lexer hack in order to parse (https://en.wikipedia.org/wiki/Lexer_hack). Rust's grammar is simple by comparison. Yes, Rust has more syntax than C, but the syntax that it has is more regular.

merlincorey · on Aug 17, 2023

> Zig is similar. Yes, they are going to replace LLVM by default, but they're not getting rid of their LLVM backend entirely.

In the article I linked, they did not say they were replacing LLVM by default, but they did say it would become the default for DEBUG builds due to the faster speed of compilation, to be clear.

> > Meanwhile, to truly understand Rust, one must be an expert in C and learn the `rustc` code base.

> Are you under the impression that the "rustc" codebase is written in C/C++? It is not... It uses LLVM, yes, but it's written in Rust.

I am not under that impression, but I can see how my phrasing leads to that conclusion.

After reviewing Rust's Bootstrap on Github[0] I can now more precisely state that one's understanding of low-level Rust will be enhanced by knowing C/C++ (for the LLVM portions) as well as Python (for the Rust does not exist on this system downloading of the stage0 binary Cargo and Rust compilers from somewhere else).

> Cranelift backend which is written in Rust

When this happens, it seems like it'll be possible to get the LLVM bits out of the bootstrap process and lead to a fully self-hosted Rust.

So while you may not personally value that, it seems like some people in the Rust community do.

[0] https://github.com/rust-lang/rust/tree/master/src/bootstrap

LegionMammal978 · on Aug 17, 2023

> When this happens, it seems like it'll be possible to get the LLVM bits out of the bootstrap process and lead to a fully self-hosted Rust.

What do you mean by "when this happens"? GP's point is that this has already happened: the Cranelift backend is feature-complete from the perspective of the language [0], except for inline assembly and unwinding on panic. It was merged into the upstream compiler in 2020 [1], and a Cranelift-based Rust compiler is perfectly capable of bootstrapping another Rust compiler (with some config changes) [2].

[0] https://github.com/bjorn3/rustc_codegen_cranelift

[1] https://github.com/rust-lang/rust/pull/77975

[2] https://github.com/bjorn3/rustc_codegen_cranelift/actions/ru...

merlincorey · on Aug 18, 2023

So then why wasn't the initial response "Rust is self hostable"?

Perhaps there's something we're both missing here.

dralley · on Aug 18, 2023

I mean, it kinda was. It was the last sentence in the first post.

It wasn't my only response because

1) I couldn't remember if it had any significant limitations and couldn't be bothered to check

2) I still objected to the premise

dale_glass · on Aug 17, 2023

I don't see how that reasoning is supposed to work in modern times.

Who out there is seriously using a compiler churned out in a weekend? The fact that you can do it doesn't mean anybody seriously would use that.

We're also not really creating architectures anymore. There's RISC-V, and Rust already supports that.

zer8k · on Aug 17, 2023

> Who out there is seriously using a compiler churned out in a weekend?

Someone at a chip manufacturer writing something for a brand new chipset, for example. It takes a long time to get stuff shoved into GCC. It's only in recent history has life settled on one or two "big" compilers. There are still plenty of other places where you will find bespoke compilers. Perhaps not commonly, but they do exist (especially in embedded).

tempest_ · on Aug 17, 2023

Are they actually bespoke compilers, or just some decade old fork of gcc that works for that chipset and is never updated again?

( I am not being facetious )

sigsev_251 · on Aug 18, 2023

From my experience, while many MCUs have settled for the big compilers (GCC and Clang), DSPs and some FPGAs (not Intel and Xilinx, those have lately settled for Clang and a combination of Clang and GCC respectively) use some pretty bespoke compilers (just running ./<compiler> --version is enough to verify this, if the compiler even offers that option). That's not necessarily bad, since many of them offer some really useful features, but error messages can be really cryptic in some cases. Also some industries require use of verified compilers, like CompCert[1], and in such cases GCC and Clang just don't cut it.

[1]: https://compcert.org/

marcosdumay · on Aug 17, 2023

Well, nowadays you are supposed to port a backend, and not write an entire compiler.

pjmlp · on Aug 17, 2023

Just wait until CVE become a liability like handling hazardous chemicals.

duped · on Aug 17, 2023

C++ shares ubiquity with C and it's not because it's easy to parse.

The reason is the same though: ISA authors know how to write backends for GCC, which is what they've done for 30 years when they want to get people to use their chip.

titzer · on Aug 17, 2023

> Most graduates of Computer Science programs can, perhaps with some trouble, implement a half decent C compiler in a weekend or two.

I see you have not met most graduates of Computer Science programs.

> for any given piece of hardware you're more likely to find a random C compiler

This might have been true 10 or 20 years ago, but these days, C has grown so complex that the random C compiler is likely gcc or LLVM with not much else.

> you can churn out a C compiler quickly for almost anything given a spec of the hardware.

Who is doing this? There aren't a lot of new ISAs coming out now, and when they do come out, they usually put effort into porting either LLVM or gcc.

deaddodo · on Aug 17, 2023

> Most graduates of Computer Science programs can, perhaps with some trouble, implement a half decent C compiler in a weekend or two.

Probably not. Compiler design (lexical+syntactical analysis and code generation) isn't covered by most basic CS bachelor's programs.

Theoretically, a CompSci graduate should be able to reference the necessary materials, ingest the theory and make a basic multipass, non-optimizing C compiler. But I would still be skeptical of a good chunk of recent (as in, lack of experience; not year of graduation) graduates; and would assume that it would take >4 days ("two weekends").

armchairhacker · on Aug 17, 2023

I took a compilers class in undergraduate. We read Appel's Modern Compiler Implementation in ML (https://www.cs.princeton.edu/~appel/modern/ml/) and built a compiler for the Tiger language (basically https://github.com/FlexW/tiger-compiler but in SML).

We covered the main chapters but not the advanced topics. If you look at the TOC (https://www.cs.princeton.edu/~appel/modern/toc.html) you can get an idea of the necessary steps:

- Lexing

- Parsing

- Symbol table generation

- Type checking

- IR

- Some static analyses (e.g. liveness)

- Some optimizations (e.g. constant folding)

- Assembly generation

- Register allocation

Me and my 1 group-mate had a working compiler at the end. It took a whole semester, sure I was taking other classes and doing other things, but it certainly took more than 2 weekends worth of work (maybe if you count raw hours, 96 in total...).

badsectoracula · on Aug 17, 2023

> Most graduates of Computer Science programs can, perhaps with some trouble, implement a half decent C compiler in a weekend or two. This is not a footnote. This fact alone means that for any given piece of hardware you're more likely to find a random C compiler you can use than anything else.

I think C being a (relatively) very simple is indeed a feature it has - however not so much because you can make a compiler for it easily (not that it isn't a pro, but it isn't that important in practice) but because it means it is easier to learn and easier to write tools for.

glouwbug · on Aug 17, 2023

Took me 6 years to implement my own in C. It ended up being python with braces. Recursive descent parser are incredibly brittle and take extreme concentration and patience

dang · on Aug 17, 2023

A Guide to Undefined Behavior in C and C++ (2010) - https://news.ycombinator.com/item?id=9884074 - July 2015 (10 comments)

A Guide to Undefined Behavior in C and C++, Part 1 - https://news.ycombinator.com/item?id=2544159 - May 2011 (2 comments)

zabzonk · on Aug 17, 2023

perhaps it is just me, but i have never experienced any of the problems outlined in the comments here, despite of writing a shedload of C and C++ code (and fortran, assembler and other stuff). and i don't think i am a coding god.

kazinator · on Aug 18, 2023

Article misses the angle that "undefined behavior" is a formal term which means that ISO C doesn't define behavior, not that nobody whatsoever defines the behavior.

The following is undefined behavior:

  #include <unistd.h>

A conforming implementation (one that obviously isn't POSIX) can behave such that the hard drive is wiped out either at compile time, or at the run-time of the translated and linked translation unit.

ISO C defines the behavior up until specifying that the header is searched for somehow and that a diagnostic is required if it is not found. Since the header is nonstandard, it's possible that it is found. And then, the standard says nothing about it.

spacechild1 · on Aug 18, 2023

Could it be that you're confusing "undefined behavior" with "implemented-defined behavior"?

kazinator · on Aug 18, 2023

No, it coudln't.

If you think so, find the chapter and verse in ISO C which says that implementors must choose a behavior for #include <unistd.h> and document it ("implementation-defined") or without being required to document it ("unspecified").

Can you prove that a conforming implementation of any version of ISO C, from the standard alone, may not segfault or wipe storage device as a result of #include <unistd.h>? (I.e. that if that were to happen, the implementation would be nonconforming?)

When a POSIX implementation gives you #include <unistd.h>, it is "behaving in a documented manner characteristic of the implementation": one of the possible ways that undefined behavior plays out!

spacechild1 · on Aug 18, 2023

I have to admit that your original post confused me.

Why exactly do think that #include <unistd.h> is undefined behavior? Here's what the C11 standard draft says in 6.10.2 (source file inclusion):

A preprocessing directive of the form # include <h-char-sequence> new-line searches a sequence of implementation-defined places for a header identified uniquely by the specified sequence between the < and > delimiters, and causes the replacement of that directive by the entire contents of the header. How the places are specified or the header identified is implementation-defined.

Seems pretty well-defined to me.

EDITED to include C standard reference.

kazinator · on Aug 18, 2023

> Why exactly do think that #include <unistd.h> is undefined behavior?

Because ISO C doesn't say what happens if the implementation finds something by that name, and replaces the #include directive with that content.

A given system could have a unistd.h file which contains

#pragma __self_destruct

which causes the machine to melt down.

spacechild1 · on Aug 18, 2023

That's a bit silly. By that logic, every single macro - and every single non-library function call - would be undefined behavior as well.

But in general you are right that undefined behavior does not necessarily have to be explicitly stated as such:

> If a ‘‘shall’’ or ‘‘shall not’’ requirement that appears outside of a constraint or runtime-constraint is violated, the behavior is undefined. Undefined behavior is otherwise indicated in this International Standard by the words ‘‘undefined behavior’’ or by the omission of any explicit definition of behavior. There is no difference in emphasis among these three; they all describe ‘‘behavior that is undefined’’.

I was not aware of this.

kazinator · on Aug 18, 2023

Why yes. This is undefined behavior:

  int main(void)
  {
     char buf[256];
     read(0, buf, sizeof buf);
  }

The translation unit doesn't define a read function anywhere, and it's not in ISO C.

Suppose that we translate this unit and link it to make a program.

There are two possibilities: (1) the program fails to link due to the unresolved reference. (2) it actually links, producing an executable.

Under (2) the behavior is undefined.

The POSIX read is a documented extension that stands in the place of undefined behavior.

Libraries have to allow the ISO C program to define a function called read because the language doesn't reserve the identifier.

The GNU C library makes read a weak symbol, aliasing to __libc_read.

Internally, it calls __libc_read, so that it doesn't break if the application redefines read.

So ISO C undefined behavior is a big area. It contains documented extensions like <unistd.h> and read, well as null pointer dereferences, divisions by zero, out-of-bounds access, double free, ...

> I was not aware of this.

You might want to keep it to yourself though, if you don't want downvotes.

spacechild1 · on Aug 19, 2023

> A program that is correct in all other aspects, operating on correct data, containing unspecified behavior shall be a correct program and act in accordance with 5.1.2.3.

With your interpretation of undefined behavior, there cannot be any correct program containing #define, #include or (non-library) function calls. This is obviously absurd.

The C standard, as any other written text, is ambiguous in certain places and requires interpretation. I am not a language lawyer, but I think your reading of "by the omission of any explicit definition of behavior" is too broad to be useful and it's very likely not what the C standard people intended. It's funny, though!

> You might want to keep it to yourself though, if you don't want downvotes.

Why? I'm happy to admit when I am wrong or learned something.

EDIT unspecified -> undefined

kazinator · on Aug 19, 2023

I'm not aware that I gave an interpretation of unspecified behavior. That is different from undefined behavior.

spacechild1 · on Aug 19, 2023

That was obviously a typo. C'mon.

kazinator · on Aug 19, 2023

I don't follow. What is a typo? Whose typo? I see that exact text as far back as C99 (in 4 Conformance) "A program that is correct in all other aspects, operating on correct data, containing unspecified behavior shall be a correct program and act in accordance with 5.1.2.3."

spacechild1 · on Aug 19, 2023

> I don't follow. What is a typo? Whose typo?

My second paragraph did contain a typo ("unspecified" instead of "undefined") which I have fixed (see EDIT).

> "A program that is correct in all other aspects, operating on correct data, containing unspecified behavior shall be a correct program and act in accordance with 5.1.2.3."

In conjuction with the preceding passage about undefined behavior, I read this as follows: Any program that does not violate ‘‘shall’’ or ‘‘shall not’’ requirements, does not contain undefined behavior and operates on correct data is a correct program; it may, however, contain unspecified behavior. How would you read this instead?

Back to 4 ("Conformance"):

> [...] Undefined behavior is otherwise indicated in this International Standard by the words ‘‘undefined behavior’’ or by the omission of any explicit definition of behavior. [...]

What is an explicit definition? It just isn't possible to completely define any phenomenon down to the smallest detail, so one can always come up with a case that isn't "explicitly defined". IMHO this passage is a mistake in the standard.

kazinator · on Aug 19, 2023

The purpose of that sentence is not entirely clear to me. It seems to be defining the term "correct program", but hasn't given the term in italics, and the term is not used anywhere. Most C programs contain unspecified behavior because, for example, the order of evaluation of function arguments is unspecified. If we call any function of two or more arguments, unspecified behavior has occurred. Usually, there is no visible difference, so it doesn't matter. It's not clear to me why that sentence includes unspecified behavior, but not implementation-defined. I think it may be intended to include implementation-defined, since implementation-defined behavior is "unspecified behavior where each implementation documents how the choice is made".

In any case, it's saying that nonportable programs can be correct. Later in another paragraph, a strictly-conforming program concept is defined which cannot produce output dependent on implementation-defined or unspecified behavior. Those behaviors themselves cannot be eliminated (like unspecified evaluation orders) but the program doesn't produce any output dependent on them.

Note that although #include <unistd.h> is defined by implementations, it is not "implementation-defined behavior". It is a documented extension. The same section talks about extensions a little down "A conforming implementation may have extensions (including additional library functions), provided they do not alter the behavior of any strictly conforming program."

An explicit definition is where the standard gives a requirement in words. I think the word "explicit" is there just for emphasis. Definitions are explicit. Explicit is the opposite of implicit; implicit means understood without words. You can't define something without words. Maybe the purpose is to discourage implementors and users from making tenuous inferences of unwritten definitions of behavior.

matt3210 · on Aug 17, 2023

What behaviors are undefined in rust? Oh wait nobody knows, since it has no standard or language spec.

jcranmer · on Aug 17, 2023

* Reading uninitialized memory

* Violating pointer provenance

* Out-of-bounds pointer accesses (though unlike C, I think, it's legal to make a pointer go out-of-bounds and bring it back in-bounds and use it)

* Use-after-lifetime

* Storing trap representations in variables

* Having two mutable references to the same memory location

* Data races

Not an exhaustive list, and C has most of these (even the last one, although change "two mutable references" to "two restrict pointers"). Of course, C itself doesn't have an exhaustive list (J.2 is not, in fact, an exhaustive list).

JonChesterfield · on Aug 17, 2023

Pointer provenance is a nice example. A block of memory cannot be read as an array of simd types sometimes and scalar types otherwise. It can't contain atomic values which are operated on using non-atomic operations during program startup before you spawn any threads.

There were proposals to let one mmap existing structures but I don't know if any landed. Usually done with reinterpret cast and hoping that rule violation doesn't break you.

Pointer provenance does make most application code faster but other times it opens a performance gap that you have to step outside of C++ to close. Compiler extensions, switching off the analysis, changing language.

agalunar · on Aug 17, 2023

> A block of memory cannot be read as an array of simd types sometimes and scalar types otherwise.

As far as I can tell, it is currently the case that, using raw pointers, this is not actually undefined behavior (but I never entirely trust my conclusions on these matters).

"&mut T and &T follow LLVM’s scoped noalias model" [1][referring to 2 and 3] but I am fairly sure this does not currently apply to raw pointers, and "provenance is implicitly shared with all pointers transitively derived from the original pointer through operations like offset, borrowing, and pointer casts." [4]

[1] https://doc.rust-lang.org/reference/behavior-considered-unde...

[2] https://llvm.org/docs/LangRef.html#pointeraliasing

[3] "noalias" under https://llvm.org/docs/LangRef.html#parameter-attributes

[4] https://doc.rust-lang.org/core/ptr/index.html

Also excellent are

https://faultlore.com/blah/fix-rust-pointers

https://www.ralfj.de/blog/2018/07/24/pointers-and-bytes.html

https://www.ralfj.de/blog/2020/12/14/provenance.html

https://www.ralfj.de/blog/2022/04/11/provenance-exposed.html

It seems likely you'd already be familiar with these; I'm just putting them out there for anyone interested.

JonChesterfield · on Aug 17, 2023

LLVM can represent various aliasing relationships, modulo some risk of C++ inspired bugs in some passes. They might all be stamped out now. I remember a bug report about one that was open for many years.

I'm happy to hear rust can (probably) represent the same relationships LLVM can. C++ cannot, at least as of about two years ago when I last looked through the corresponding papers. All it can do is different types do not alias, where atomic_int and int are different types.

tialaramex · on Aug 18, 2023

No, LLVM definitely still has big problems. https://github.com/llvm/llvm-project/issues/45725 is an example, the symptom in Rust is that you can write what is in effect a pointer comparison in which LLVM ends up claiming that two things are different, although they are also identical...

angiosperm · on Aug 17, 2023

Use of mmap itself is undefined in the language.

Posix provides a definition that programs rely on, instead. Implementers are allowed to define literally anything the union of all standards leaves undefined.

JonChesterfield · on Aug 17, 2023

Mmap itself is alright. You've got a void* from somewhere, that's OK. You can placement new into it to make objects.

What isn't allowed is casting it to a hashtable type and then using it as such. Because there is no hashtable instance anywhere, and specifically not there, so you've violated the pointer aliasing rules.

The obvious fix is to guarantee that placement new doesn't change the bytes, perhaps only for trivially copyable types or similar constraint. I didn't see the proposals in that direction land but also didn't see them fail, so maybe the newer standard permits it.

LegionMammal978 · on Aug 17, 2023

As I understand it, that's precisely what std::start_lifetime_as<T>() does: it effectively performs a placement new to create a T object, except that it retains the existing bytes at the address. It only works with implicit-lifetime types (i.e., scalars, or classes with a trivial constructor), though, so it probably wouldn't work with your hash table example, except perhaps for an inline hash table.

JonChesterfield · on Aug 17, 2023

Superb! Looking through https://en.cppreference.com/w/cpp/memory/start_lifetime_as, this appears to be the right thing. It also has volatile overloads (which it looks like placement new still does not). This doesn't appear to be implemented in libc++ yet but that seems fixable, it'll go down the same object construction logic placement new does. Thank you for the reference, that'll fix some ugly edge cases in one of my libraries.

angiosperm · on Aug 18, 2023

To call mmap, you are calling a function that is not in the collection of translation units that makes the program. Libraries are beyond the Standard. Include files not listed in the Standard, likewise. So you rely on Posix, there.

For objects got by casting void* to a known type, you rely on the compiler being unable to prove that the objects didn't exist already, somewhere in the program. Pray the compiler doesn't get smart enough to notice no constructor for that type is linked, meaning you that you couldn't have made that object.

klankerzz · on Aug 18, 2023

Can't you also just say that mmap() is a magical function that you don't know what it does?

For all the compiler knows, mmap() could just be a:

  static Hash_Table h; return (void *)&h;

And make that the rule for all externally defined functions.

gpderetta · on Aug 18, 2023

Indeed. That's the way to reason about correctness of opaque functions.

proto_lambda · on Aug 17, 2023

There is no undefined behaviour in Safe Rust. You're right about Unsafe Rust of course.

lionkor · on Aug 17, 2023

The ultimate "the code is the documentation" is "the compiler is the language spec".

thesuperbigfrog · on Aug 17, 2023

>> The ultimate "the code is the documentation" is "the compiler is the language spec".

Rust has a great potential to become a replacement for C and C++, but the lack of a language specification is a shortcoming that needs to be addressed for it to see wider adoption, especially for safety-critical systems.

If the Rust compiler does something surprising, people will ask, "Is this a bug?" and without a spec the answer becomes the language developers or the community asking, "What should the compiler do in this situation?".

It makes sense because the correct behavior (whatever that is) has not been defined, but it has a feeling of "we are making this up as we go along" because there is no formalized answer defined. While this approach is fine for running your website or building a command line tool, it is not acceptable for safety-critical software. If the software breaks and people die, the "we are making this up as we go along" approach is not acceptable because it has too much risk.

lionkor · on Aug 17, 2023

I fully agree, and its definitely a strange feeling coming from C++ to not have a single, complete and extensive spec to read up on if all else fails.

I want to like Rust, but its already a kitchen sink on par with C++ in complexity and misused quirks, not to mention macros which hide complexity just like C macros did, that the lack of a committee and spec makes it very difficult to trust that it won't get more and more features as time goes on (becoming like C++, in only the bad ways).

I understand they have an RFC process, but thats not enough for a language which is now so commonplace in discussion (usually in the form of "if you did it in Rust, this problem wouldnt exist", which is often even true).

tialaramex · on Aug 17, 2023

> a single, complete and extensive spec to read up on if all else fails

Did you try using the "single, complete and extensive spec" ? What for and how successful was that ?

The ISO C++ standard was published in 1998, so, about 25 years ago. One of the things it says, even in the C++ 23 standard that's likely to be published later this year, is that some input files have Undefined Behaviour during parsing.

But, wait a minute, Undefined Behaviour is a runtime property. Parsing isn't a runtime activity. This "complete" specification clearly was never even proofread. Which makes a kind of sense, it's an enormous sprawling document, why would anybody properly read it. But, if they actually don't, what's the point ?

The fix for this - hopefully to land in C++ 26 - is P2612, named "UB? In my lexer?" because it's been so long that even "It's more likely than you think" memes https://knowyourmeme.com/memes/its-more-likely-than-you-thin... are now dad jokes. But don't focus on this particular minor bug, which is not a big deal, focus on what it means about the value of the specification.

jcranmer · on Aug 17, 2023

> But, wait a minute, Undefined Behaviour is a runtime property. Parsing isn't a runtime activity.

So it turns out you're wrong here. UB can also be intentional extension points, and these aren't implementation-defined behaviors because honestly I don't want to track down 25-year-old documents to figure out what's going on here. This use of UB in the standard has diminished greatly in the past few decades (although there are still remnants of it kicking around, e.g., the lexer UB), and the extra focus on nasal demon aspects of UB from ~15 years ago really obscures this nature of UB.

One annoying thing about UB is that it is actually several different concepts with the same name. In addition to the aforementioned use, it can also refer to behavior that can go haywire in ways really impossible to constrain (buffer overflow is the classical example here). Or it can refer to intentional optimization instructions (e.g., restrict and strict aliasing). Or it can refer to axioms you need to have hold or else you have no clue how to think about semantics (pointer provenance, data races). Or, incorrectly but depressingly common, lay people can use it to refer to what the specification considers implementation-defined behavior (e.g., size of data types). Working out which kind of UB people are referring to when they use the term is frustrating at best, and frequently people are using one kind of UB to justify how all kinds should be handled. (Annoyingly, some of those people are committee members.)

tialaramex · on Aug 18, 2023

> because honestly I don't want to track down 25-year-old documents to figure out what's going on here

When it was written these weren't 25 year old. So this seems like a poor rationale. The answer is that they should just have written that it's ill-formed and they didn't. That's a completely understandable mistake, but it's telling that it wasn't fixed for so long people grew up, had kids, and the kids are writing the proposal to fix it. As to the idea of multiple "kinds" of UB, the standard defines this term exactly once.

There are a few things that I expect to see from Bjarne, Herb and WG21 generally that will mean they've finally figured out the true nature of the problem. When / if I see those things they may begin work to get C++ to where it'd need to be to stay relevant - not "relevant" the way COBOL is relevant, but relevant the way C++ still is in 2023. Meanwhile they're gliding, losing momentum.

Firstly, and the biggest hurdle, that the problem is Cultural. Yes Rust has some nicer technology, that's not enough, the technology supports a Culture, you could build C++'s culture with Rust's technology but that's worse, so, don't waste your time doing that.

Next though, most important of the technical insights and unwelcome if you spent your life on the C++ language, there are two choices of what to do about Rice's Theorem and C++ chose wrong, it will need to fix that, and the fix isn't cheap because it's a broad change to the entire language standard. If you have no stomach for that fix, it's likely actually better to announce that unsafety is your intent, and wrestle with the consequences as they are than to pretend you don't need the fix to get safety which is false.

What I mean here is, suppose I wrote a program which I say is safe, but the compiler can't see why it's safe. In Rust that's simple, the program doesn't compile. In C++ though the program compiles, and, if I'm correct, it's safe, but, if I'm wrong it has Undefined Behaviour (actually it's a bit worse, but that'll do in context). Henry Rice showed that we have no choice in these rich high level languages (which want non-trivial semantic properties of software), such programs will definitely exist, C++ allows this to happen a lot and Rust works hard to avoid that where possible, because in C++ the consequence is it compiles anyway and in Rust the consequence is it won't compile so that's undesirable.

lionkor · on Aug 18, 2023

> Did you try using the "single, complete and extensive spec" ? What for and how successful was that ?

Yes, I did. I used the ISO standard of the version of C++ I was using (17 or 20, dont remember) to look up how variables are initialized if they arent explicitly initialized, and it turns out the standard has a very clear definition of e.g. variables in a function which are of a class type have their default ctor called, or something like that.

So it was very successful. No idea what your UB rant is about.

Dylan16807 · on Aug 18, 2023

> https://knowyourmeme.com/memes/its-more-likely-than-you-thin...

I like that your post exposes a bug in HN's parsing. 61 character URL gets turned into 60 characters plus 3 periods.

lionkor · on Aug 19, 2023

The href is still correct in their comment, while in yours, since you copied it, its not

Dylan16807 · on Aug 19, 2023

The href is correct but the link shortener made it longer, so I call that a bug.

I can only quote the visible text because of how the site works. Maybe I should have put it in a code block so it wouldn't link.

iknowstuff · on Aug 17, 2023

Rust macros don't hide anything. They're hygienic and clearly annotated when used.

mike_hock · on Aug 17, 2023

Rust macros are a crutch to work around the language's shortcomings. It's just a better crutch than C's.

duped · on Aug 18, 2023

I'm seeing rust adoption accelerate largely because time is spent improving the language and implementation rather than bemoaning a spec.

> without a spec the answer becomes the language developers or the community asking, "What should the compiler do in this situation?".

No, they ask "what does the RFC say"

> If the software breaks and people die, the "we are making this up as we go along" approach is not acceptable because it has too much risk.

The spec does not define the software. The software is as the software does. Having or not having a spec doesn't protect from bugs - people do.

What you're taking about is covering one's ass, not specification.

thesuperbigfrog · on Aug 18, 2023

>> The spec does not define the software. The software is as the software does. Having or not having a spec doesn't protect from bugs - people do.

>> What you're taking about is covering one's ass, not specification.

They are related.

In safety-critical software, bugs can cause people to die. Without a spec, no one will use Rust for safety critical software. It would be too risky and no company would accept that level of risk.

For example if software that controls an airplane is written in Rust and an error occurs during flight, what happens? The software can't just panic and crash or the airplane might crash.

The Ferrocene project (https://ferrous-systems.com/ferrocene/) is working on producing a safety-critical Rust specification (https://github.com/ferrocene/specification) because having a language specification matters for safety-critical work.

duped · on Aug 18, 2023

How does a spec fix bugs?

AnimalMuppet · on Aug 18, 2023

It doesn't fix bugs (in the language/compiler). It documents exactly what the language is supposed to do, and therefore what the language user can count on. Without a spec, all you can count on is what you can experimentally determine that the language does (and even then, you can't be sure that it will do that in all situations).

This actually reduces bugs in applications, because it means that the app writers now know what the language will actually do, and can write their code accordingly. Without a spec, they will too often have a cargo cult understanding of what the language does, and so their code won't do what they intend it to.

If there's a spec, and the compiler/language doesn't do what the spec says, now you can definitively say that it's a bug in the language. That can still cause bugs in your app. But at least you can now definitively say that the language implementation is at fault, and demand that they fix it, and you can agree with the language authors on what "fixed" means.

duped · on Aug 18, 2023

So what you're saying is, a spec is no better than documentation and design documents.

It doesn't sound materially helpful or like it saves lives - it seems like a contrived requirement.

AnimalMuppet · on Aug 18, 2023

You do not have the right to put words in my mouth, or to claim that your twisted version is what I was saying.

A spec is more detailed and more precise than (other) documentation and design documents. ("Other" because the spec is itself part of the documentation, and one of the design documents.) For the safety-critical software itself, you would demand a full, formal spec, not just "documentation". (At least, if you wouldn't, then others would. and they are right to do so.)

But if you demand that for the software, doesn't it make at least some sense to ask it of the compiler? And even if you don't think it makes sense for the compiler, it seems reasonable that the standard libraries of the language should face the same requirements as the subroutines that are part of the safety-critical software.

iknowstuff · on Aug 17, 2023

>a shortcoming that needs to be addressed for it to see wider adoption, especially for safety-critical systems.

This seems like just a hunch of yours that does not seem to be reflected by the real world.

thesuperbigfrog · on Aug 17, 2023

>> This seems like just a hunch of yours that does not seem to be reflected by the real world.

What safety-critical systems are written in Rust?

Where can I buy a validated Rust toolchain for safety-critical work?

Ferrocene is an effort to build a safety-critical Rust, but it is not done yet:

https://ferrous-systems.com/blog/ferrocene-update/

mjw1007 · on Aug 17, 2023

The good news is that the Rust project has recently agreed to write a specification, and has a budget to hire an editor for it.

The less good news is that it's likely to take a long time before anything resembling a complete description gets written.

You can follow its status at https://github.com/rust-lang/rust/issues/113527

thesuperbigfrog · on Aug 17, 2023

>> The good news is that the Rust project has recently agreed to write a specification, and has a budget to hire an editor for it.

This is awesome to hear. Following that issue . . .

AnimalMuppet · on Aug 17, 2023

(2010)

dang · on Aug 17, 2023

Added. Thanks!