Cproc C Compiler

unwind · on Aug 20, 2021

Very cool! The code looked very readable and clear too, from the little I looked at (mainly expr.c).

One minor stylistic observation is that even though it's a C11 compiler, itself written in C11, it holds on to the very Old-Sk00l convention of declaring a function's variables at the top, instead of at point of use.

It's at least very consistent, even the statement after the declaration is an assignment to a variable, that variable is still separately declared above. Example (again, from expr.c):

    static struct expr *
    mkexpr(enum exprkind k, struct type *t)
    {
      struct expr *e;
      e = xmalloc(sizeof(*e));

Really glad [1] it doesn't cast the return value of the alloc function, at least. :) I would write the body code as

    {
      struct expr * const e = xmalloc(sizeof *e);

The 'const' there makes it impossible to re-assign e, thus leaking the memory, and is IMO a good thing.

NOTE: I realize the separate declarations is a stylistic choice and I'm not being critical, I just think it's fun how enduring some theoretically outdated stylistic choices are in C.

[1] https://stackoverflow.com/questions/605845/do-i-cast-the-res...

adrian_b · on Aug 20, 2021

The convention of declaring all the variables at the top originates in Fortran, so it is indeed very old school.

When writing a program, I find it more convenient to declare the variables at the first point of use.

Nevertheless, when reading a program written by someone else, I strongly prefer to see all the variable declarations together.

It might be that this preference is caused by the fact that I mainly use C when writing programs for resource-constrained embedded environments and in such applications it is valuable to be able to estimate at the first glance to a function which are its memory requirements.

jes · on Aug 20, 2021

I don’t write in C very often anymore. But when I last did, I found myself establishing local scopes inside functions so that variables I introduced went out of scope when I was done with them.

I’m also an oddball, so there’s that.

Someone · on Aug 20, 2021

With non-trivial functions and modern compilers, that gives you an upper bound that may differ significantly from the actual stack usage.

There is no requirement for the compiler to keep the memory of all locals distinct.

On more powerful CPUs, chances also are a large fraction of the locals never makes it onto the stack, but stays in registers.

Having said that, compilers for embedded CPUs used to be fairly bad at optimizations (wouldn’t know of that has changed in the past decades)

veltas · on Aug 20, 2021

It's not just an Old-Sk00l convention, but the main convention for C programming, alongside avoiding C++ comments. It's cultural and I think C programmers are now trained to worry when you don't write code like this because they associate it with the C code that newer C++ programmers write when they're forced to write some C code or try to convert, inevitably with a lot of unnecessary paranoid cruft and C++-inspired misconceptions.

So in the purely practical sense of trying to write code that doesn't 'smell' or send off the wrong signals it's worth sticking to the conventions. And there's no problem with C++ having its own conventions as well. They are two different languages with diverged syntax and semantics.

Unlike the culture of C++ that has gotten a lot of benefits from the progression of the standard, in C the standard is not that interesting, does not change as much, and unfortunately compiler support was extremely slow (especially with C99, the first revision following the initial standard). In fact, many C99 features became optional in C11, and the attitude of most C programmers has been "use what you need" with new features, perhaps to make it easier to port to the exotic and outdated compilers that are more frequently found in the major application of most modern C code: embedded software.

flohofwoe · on Aug 20, 2021

Hmm, I don't agree here. Declaring variables close to where they are used is objectively a good practice. "Pure" C89 isn't really relevant since more than two decades, teaching new C programmers such old habits would be actively harmful (sticking to old and outdated C89 features is actually a problem that many C tutorials still have, along with introducing malloc too early and for all the wrong reasons).

veltas · on Aug 20, 2021

I don't see how it is "actively harmful" to use this style.

I'll defend the style and say I think either is fine from an objective standpoint, the advantage of declaring near the top and leaving uninitialised is that it means all assignments/values are separate and slightly clearer in code, type annotations are clutter. But likewise it's not that much clutter and you can get used to it (and it's conventional in many other languages).

Only in very large functions does it significantly 'hide' the declarations and the type information from where they're used, and so arguably that encourages you to not write very large functions with lots of different variables! But don't pick on this point and say "the limitation becomes a 'feature'", I'm just saying with most good code this doesn't matter, not that it's 'good' that it's harder to write long functions, yes that is a limitation, I just don't think it's a significant one for good code.

So I don't see how either is objectively superior. I recommend this style because since there is not a huge difference between the two, I prefer what is conventional.

jcelerier · on Aug 20, 2021

It's very much not ok : it encourages reusing a single variable for multiple uses (think loop index for two successive loops). I can't count the number of bugs I've seen because of that, of someone forgetting that there was already a 'ret' or 'i' variable defined at the top and reusing it directly when it should have been cleared... Minimizing the scope of variables and introducing sub-scopes with braces, makes this problem almost entirely disappear.

veltas · on Aug 20, 2021

In my own experience writing and reviewing professional or wild code, a variable called 'i' that gets used for different loops in a function is pretty harmless, if it's reused in a consistent manner, such as only being used as an iterator for top-level loops. Is that re-use with older style declarations really such a sin?

And your rule for safer variables with minimal scoping requires the programmer to understand the lifetime and usage of different variables, and appreciate why it's bad to re-use things in a confusing way across a function. But, if they understand that then surely they can already write good code: with or without minimal scoping? What's the difference?

Yes, if they attempt to use a variable with one name and in a different block where it means something else then they might get a syntax error (or just visually notice it's out of scope). Why do you assume this error will help them correctly determine whether the same-name-variable is unrelated or just happens to start with more nested scope than it should finish? Or, rather, why do you think the scoping specifically is what will help them get this right? Maybe just drawing their attention to the need to give different names to disambiguate, or to keep better attention to the meaning of a variable throughout a function, is what's helping?

jcelerier · on Aug 20, 2021

> And your rule for safer variables with minimal scoping requires the programmer to understand the lifetime and usage of different variables, and appreciate why it's bad to re-use things in a confusing way across a function.

No, there is nothing to understand: the rule to minimize a variable's scope is mechanical.

> But, if they understand that then surely they can already write good code: with or without minimal scoping? What's the difference?

No. It's impossible to assume that even the very best programmer writes good code consistently, thus every possibility for failure must be automatized or mechanized.

veltas · on Aug 21, 2021

> No, there is nothing to understand: the rule to minimize a variable's scope is mechanical.

I tried to explain my answer to this in my third paragraph.

> It's impossible to assume that even the very best programmer writes good code consistently, thus every possibility for failure must be automatized or mechanized.

This is a strawman, if I thought the feature was useful to prevent this I would give it more consideration. I have already attempted to explain why I don't think conservative scoping aids bad/confusing variable re-use.

Twisol · on Aug 20, 2021

Hey, you write `const` on the right too! There must be, like, dozens of us!

Is there a style guide recommending this, or have I just managed to dodge every example that puts it on the right?

Vogtinator · on Aug 20, 2021

    int * const foo = baz;

and

    int const * bar = baz;

have different meaning.

foo is a constant pointer (you can't change the address) while bar is a pointer to a constant int (you can't change the value it's pointing at).

However,

  int const * bar;

is the same as

  const int * bar;

tialaramex · on Aug 20, 2021

In C++ this is called East Const vs West Const with, as someone explains below, East Const (const on the right) being the only one that actually does something consistent and explicable.

Twisol · on Aug 20, 2021

Thanks for the nomenclature! That definitely helps find more discussion around the two preferences, like [0].

(FWIW, I'm aware of the semantic differences; I've just never come across that much material referencing it, much less acknowledging the schism.)

[0] https://hackingcpp.com/cpp/design/east_vs_west_const.html

SAI_Peregrinus · on Aug 20, 2021

Jens Gustett's book [Modern C](https://modernc.gforge.inria.fr) recommends this. I follow it as well.

enqk · on Aug 20, 2021

look up eastside const

https://mariusbancila.ro/blog/2018/11/23/join-the-east-const...

I’ve been using this style at home in a longtime now, I appreciate how it’s more at home with C and more consistent

mananaysiempre · on Aug 20, 2021

It may look more consistent, but it misleads the reader about the (inconsistent) way the C declaration syntax works: in

  int *const p, *q;

p is constant but q is not, while in either of

  const int i, j;
  int const i, j;

both i and j are constant, and

  int i, const j;

is a syntax error. (This is still true but less relevant in C++, which made such an unintuitive mess of the declaration syntax that now the usual advice is to avoid declaring more than one variable per declaration at all.)

tialaramex · on Aug 20, 2021

I don't see an inconsistency? The whole point of East Const is that the type to its left is constant. Clearly the pointer p is const and the pointer q isn't, because there's no way that asterisk next to q is affected by a const over on the far side of a comma.

mananaysiempre · on Aug 21, 2021

Hm. I can see where you’re coming from, but I don’t think pretending the C declaration syntax is postfix or even involves a syntactic entity worthy of being called a “type” is particularly helpful (abstract declarators anyone? now Standard ML does in fact have a mostly-postfix type sublanguage). For example, what is the “type to the left” that the const in the constant function pointer declaration

  int *(*const f)(void);

is supposed to apply to?

tialaramex · on Aug 21, 2021

Maybe you have higher standards for what constitutes a "type" than were prevalent in the 1970s. Obviously C isn't trying to be SML here but those are definitely types.

The type of f is: constant pointer to a function that takes no parameters but returns a pointer to an integer.

The thing to the const keyword's left (thus, the thing we know is constant because East Const is consistent) is an asterisk, indicating a pointer type, inside parentheses, a function, so that's a function pointer, and we can use the rules for figuring out the function declaration to determine what type of function it is.

Reading types this way has to be how you do it, all the time in C, const isn't in fact a weird exception, except in the sense that obviously constant ought to be the default, but it's too late for that in C.

andrewchambers · on Aug 20, 2021

I honestly don't mind going without C const (compared to C++ where it is more useful).

bregma · on Aug 20, 2021

"const" in C is like underwear. Some people are more comfortable without it but it serves a real and practical purpose and there will be times when you really appreciate its appropriate use.

veltas · on Aug 20, 2021

I used to use const everywhere and then one day I realised it clutters code and it's never caught a bug for me. I don't personally find it worth the extra time.

(Of course, I use const with pointers on interfaces for read-only referencing).

coliveira · on Aug 20, 2021

Agreed. I never use const if not required (unfortunately C++ forces you to use it). It is ugly and you're doing work that should be done by the compiler. Moreover, like everything in C++, it doesn't need to be enforced, so it is mostly a moot point.

mhh__ · on Aug 20, 2021

Don't cast the result of malloc chanting always seems like C programmers getting envious of C++ legalese to me, who cares

unwind · on Aug 20, 2021

I (quite clearly) care quite a lot. :) Of course that doesn't matter, but I do think avoiding pointless casts makes the code better. Fewer tokens to process and read, fewer more or less hidden assumptions on what is going on makes the code more clear and concise.

I can't understand why people are in favor of writing more code for no reason, especially casts which are sometimes mentioned as one of those things that make C annoying/dangerous/hard/bad.

If it's like "I feel it makes it more clear that the pointer is being converted from scary void pointer to cozy foo pointer", then I would argue that's exactly what a line of code like

    foo * const my_foo = malloc(sizeof *my_foo);

is saying, adding a cast on the right hand side of the assignment doesn't make that more clear. To me, it makes the code much more anxious-feeling, as if the programmer doesn't know what will happen without it, and just adds one "for good measure", which I consider a code smell.

GrumpySloth · on Aug 20, 2021

One of the things that make C casts annoying/dangerous/hard/bad is that they are often implicit. If I didn't know malloc, then I wouldn't know that in your line of code there is a cast. And casts between pointers of different types can sometimes cause UB... There is a cast, whether you write it or not; it's just implicit, and that's bug-prone.

unwind · on Aug 20, 2021

Okay, I see what you mean.

One obvious point here is of course that if you don't know malloc() then you don't know C. Considering C is a pretty small language, and that malloc() isn't exactly at the furthest, most dusty and least-trodden part of its standard library, I think this is basic knowledge for anyone writing C.

That said, I'm not sure I agree since implicit pointer conversions are not in general allowed, except to/from void * and then there will always be a properly typed thing close by (like the left-hand side in this example).

Can you provide an example of where an implicit pointer conversion creates danger?

jlokier · on Aug 20, 2021

> Can you provide an example of where an implicit pointer conversion creates danger?

In your malloc example, there's no real danger because you used sizeof. But when it is written with a type like this:

  bar = malloc(sizeof(BarType));

Then there's danger if the type of bar does not match BarType *, perhaps after changes.

Writing it like this:

  bar = (BarType *)malloc(sizeof(BarType))

ensures the type mismatch will cause a compile-time error instead of UB.

That's not a strong argument, obviously, as you can use sizeof(*bar) instead. But it is an example of a situation where an implicit pointer conversion creates danger.

(I would also argue that a malloc(sizeof(type)) is more idiomatic and familiar C, so using it with other people will raise fewer eyebrows even if it's not strictly the safer choice.)

So when would it occur without sizeof(*bar) being an option? When there's a type-erased container:

  foo = list_get(FooType, foo_list, index);
  bar = hash_get(BarType, bar_hash_table, key);

If those are generic containers whose accessors return void *, you will be in trouble with UB here if the type of *foo or *bar don't match FooType and BarType respectively, perhaps after changes.

When you add the cast, again the type mismatch will cause a compile-time error.

In this case, there's no need to write the cast at the call site. Since list_get and hash_get must be macros, it's better to put the cast inside those macros. But the cast should exist, to catch the type mismatch, rather than those returning void * and relying on implicit cast rules.

(If you want to get fancy, there's a way to avoid having to pass the container item-type to those macros as an argument, but I've never seen it used in practice.)

foxfluff · on Aug 20, 2021

Even if you don't know malloc, you know that it returns void * or foo *. I think most C programmers consider implicit conversion between void * and other object pointers a feature. And yes, it's called an implicit conversion, not a cast -- cast is explicit.

jjnoakes · on Aug 20, 2021

The opposite is more dangerous. If you include the explicit cast and for some reason malloc was not declared (you forgot an include for example) then the implicit int return type that the compiler assumed will cause UB at runtime after you convert the result of malloc to a pointer of a different size.

flohofwoe · on Aug 20, 2021

malloc() returns a void*, that's enough of a hint that the result can be assigned to any other pointer type without requiring a cast.

Nowadays even C compilers warn when trying to assign "incompatible" pointer types.

turminal · on Aug 20, 2021

Note that cproc uses another small but very interesting project as the backend, QBE[1]. It currently has support for x86-64 and arm64, and a risc-v port is in the making.

[1]: http://c9x.me/compile/

muth02446 · on Aug 20, 2021

https://github.com/robertmuth/Cwerg is similar in nature it supports arm64 and arm32 and can directly generate elf executables.

turminal · on Aug 20, 2021

Very interesting, thanks for sharing.

xvilka · on Aug 20, 2021

Would be nice to support it in Meson. Currently, there is work-in-progress TinyCC support [1]. But the work is hindered by the lack of recent TCC releases - 0.9.27 is too old to be supported.

[1] https://github.com/mesonbuild/meson/pull/8248

potus_kushner · on Aug 21, 2021

the approach of meson to hardcode the properties (including the name of the main executable!) of each and every compiler in the universe seems pretty short-sighted if not outright dumb and isn't quite how the designers of UNIX, the C language and compiler toolchains envisioned a build system to work.

if meson was properly designed, a user could just run `CC=whatevercc meson ...` and it would work with basically each and every single compiler available on *NIX and every POSIX compatible environment, including gcc, clang, tcc, cproc and even within cygwin. the only compiler i know that works differently is MSVC and it would be easier to just provide a wrapper that transforms POSIX-compatible compile command lines into ones that MSVC understands.

(this reminds me of the head-desk approach CMake takes to adding library dependencies: each library has a hardcoded configuration file, which can go as far as scanning your /usr/lib for libfoo.so rather than just assuming "-lfoo works, and if it doesn't, use $PKG_CONFIG --libs foo"...)

the author of cproc actually filed an issue report in 2019 pointing out this very fallacy, which is still open: https://github.com/mesonbuild/meson/issues/5406

arthur2e5 · on Aug 20, 2021

Cproc+QBE doesn’t seem to be trying to do the linker job itself, so presumably library order (apparently a main point in that PR) does not need to be changed?

outsomnia · on Aug 20, 2021

Probably makes sense to integrate a preprocessor before doing anything else, even somebody else's to get started.

Because that is pretty offputting for casually trying it on existing codebase to see what happens.

andrewchambers · on Aug 20, 2021

There is one, see pp.c.

celest1 · on Aug 20, 2021

Sorry for the stupid question, but what is the benefit of building a C compiler nowadays, and what is interesting about this particular one?

veltas · on Aug 20, 2021

Your C compiler is probably a very large C++ program written over many years and with millions of dollars worth of time poured into it. The C compiler I use takes up hundreds of megabytes of space in a minimal build.

So in my opinion it's interesting if someone can write something original with 80% of the useful functionality on their own, or achieve it in a different way. And their code might be more approachable/educational to people than the behemoth.

andrewchambers · on Aug 20, 2021

The benefit is mainly that it is interesting and easy to learn/experiment because it has a very small and tidy implementation..

Just an example, building cproc from source takes a few seconds - building gcc from source takes like 15 minutes at least.

jjice · on Aug 20, 2021

> very small and tidy implementation

Especially impressive for implementing the C11 spec and some GNU extensions.