Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Cproc C Compiler (github.com/michaelforney)
130 points by oenetan on Aug 20, 2021 | hide | past | favorite | 45 comments


Very cool! The code looked very readable and clear too, from the little I looked at (mainly expr.c).

One minor stylistic observation is that even though it's a C11 compiler, itself written in C11, it holds on to the very Old-Sk00l convention of declaring a function's variables at the top, instead of at point of use.

It's at least very consistent, even the statement after the declaration is an assignment to a variable, that variable is still separately declared above. Example (again, from expr.c):

    static struct expr *
    mkexpr(enum exprkind k, struct type *t)
    {
      struct expr *e;
      e = xmalloc(sizeof(*e));
Really glad [1] it doesn't cast the return value of the alloc function, at least. :) I would write the body code as

    {
      struct expr * const e = xmalloc(sizeof *e);
The 'const' there makes it impossible to re-assign e, thus leaking the memory, and is IMO a good thing.

NOTE: I realize the separate declarations is a stylistic choice and I'm not being critical, I just think it's fun how enduring some theoretically outdated stylistic choices are in C.

[1] https://stackoverflow.com/questions/605845/do-i-cast-the-res...


The convention of declaring all the variables at the top originates in Fortran, so it is indeed very old school.

When writing a program, I find it more convenient to declare the variables at the first point of use.

Nevertheless, when reading a program written by someone else, I strongly prefer to see all the variable declarations together.

It might be that this preference is caused by the fact that I mainly use C when writing programs for resource-constrained embedded environments and in such applications it is valuable to be able to estimate at the first glance to a function which are its memory requirements.


I don’t write in C very often anymore. But when I last did, I found myself establishing local scopes inside functions so that variables I introduced went out of scope when I was done with them.

I’m also an oddball, so there’s that.


With non-trivial functions and modern compilers, that gives you an upper bound that may differ significantly from the actual stack usage.

There is no requirement for the compiler to keep the memory of all locals distinct.

On more powerful CPUs, chances also are a large fraction of the locals never makes it onto the stack, but stays in registers.

Having said that, compilers for embedded CPUs used to be fairly bad at optimizations (wouldn’t know of that has changed in the past decades)


It's not just an Old-Sk00l convention, but the main convention for C programming, alongside avoiding C++ comments. It's cultural and I think C programmers are now trained to worry when you don't write code like this because they associate it with the C code that newer C++ programmers write when they're forced to write some C code or try to convert, inevitably with a lot of unnecessary paranoid cruft and C++-inspired misconceptions.

So in the purely practical sense of trying to write code that doesn't 'smell' or send off the wrong signals it's worth sticking to the conventions. And there's no problem with C++ having its own conventions as well. They are two different languages with diverged syntax and semantics.

Unlike the culture of C++ that has gotten a lot of benefits from the progression of the standard, in C the standard is not that interesting, does not change as much, and unfortunately compiler support was extremely slow (especially with C99, the first revision following the initial standard). In fact, many C99 features became optional in C11, and the attitude of most C programmers has been "use what you need" with new features, perhaps to make it easier to port to the exotic and outdated compilers that are more frequently found in the major application of most modern C code: embedded software.


Hmm, I don't agree here. Declaring variables close to where they are used is objectively a good practice. "Pure" C89 isn't really relevant since more than two decades, teaching new C programmers such old habits would be actively harmful (sticking to old and outdated C89 features is actually a problem that many C tutorials still have, along with introducing malloc too early and for all the wrong reasons).


I don't see how it is "actively harmful" to use this style.

I'll defend the style and say I think either is fine from an objective standpoint, the advantage of declaring near the top and leaving uninitialised is that it means all assignments/values are separate and slightly clearer in code, type annotations are clutter. But likewise it's not that much clutter and you can get used to it (and it's conventional in many other languages).

Only in very large functions does it significantly 'hide' the declarations and the type information from where they're used, and so arguably that encourages you to not write very large functions with lots of different variables! But don't pick on this point and say "the limitation becomes a 'feature'", I'm just saying with most good code this doesn't matter, not that it's 'good' that it's harder to write long functions, yes that is a limitation, I just don't think it's a significant one for good code.

So I don't see how either is objectively superior. I recommend this style because since there is not a huge difference between the two, I prefer what is conventional.


It's very much not ok : it encourages reusing a single variable for multiple uses (think loop index for two successive loops). I can't count the number of bugs I've seen because of that, of someone forgetting that there was already a 'ret' or 'i' variable defined at the top and reusing it directly when it should have been cleared... Minimizing the scope of variables and introducing sub-scopes with braces, makes this problem almost entirely disappear.


In my own experience writing and reviewing professional or wild code, a variable called 'i' that gets used for different loops in a function is pretty harmless, if it's reused in a consistent manner, such as only being used as an iterator for top-level loops. Is that re-use with older style declarations really such a sin?

And your rule for safer variables with minimal scoping requires the programmer to understand the lifetime and usage of different variables, and appreciate why it's bad to re-use things in a confusing way across a function. But, if they understand that then surely they can already write good code: with or without minimal scoping? What's the difference?

Yes, if they attempt to use a variable with one name and in a different block where it means something else then they might get a syntax error (or just visually notice it's out of scope). Why do you assume this error will help them correctly determine whether the same-name-variable is unrelated or just happens to start with more nested scope than it should finish? Or, rather, why do you think the scoping specifically is what will help them get this right? Maybe just drawing their attention to the need to give different names to disambiguate, or to keep better attention to the meaning of a variable throughout a function, is what's helping?


> And your rule for safer variables with minimal scoping requires the programmer to understand the lifetime and usage of different variables, and appreciate why it's bad to re-use things in a confusing way across a function.

No, there is nothing to understand: the rule to minimize a variable's scope is mechanical.

> But, if they understand that then surely they can already write good code: with or without minimal scoping? What's the difference?

No. It's impossible to assume that even the very best programmer writes good code consistently, thus every possibility for failure must be automatized or mechanized.


> No, there is nothing to understand: the rule to minimize a variable's scope is mechanical.

I tried to explain my answer to this in my third paragraph.

> It's impossible to assume that even the very best programmer writes good code consistently, thus every possibility for failure must be automatized or mechanized.

This is a strawman, if I thought the feature was useful to prevent this I would give it more consideration. I have already attempted to explain why I don't think conservative scoping aids bad/confusing variable re-use.


Hey, you write `const` on the right too! There must be, like, dozens of us!

Is there a style guide recommending this, or have I just managed to dodge every example that puts it on the right?


    int * const foo = baz;
and

    int const * bar = baz;
have different meaning.

foo is a constant pointer (you can't change the address) while bar is a pointer to a constant int (you can't change the value it's pointing at).

However,

  int const * bar;
is the same as

  const int * bar;


In C++ this is called East Const vs West Const with, as someone explains below, East Const (const on the right) being the only one that actually does something consistent and explicable.


Thanks for the nomenclature! That definitely helps find more discussion around the two preferences, like [0].

(FWIW, I'm aware of the semantic differences; I've just never come across that much material referencing it, much less acknowledging the schism.)

[0] https://hackingcpp.com/cpp/design/east_vs_west_const.html


Jens Gustett's book [Modern C](https://modernc.gforge.inria.fr) recommends this. I follow it as well.


look up eastside const

https://mariusbancila.ro/blog/2018/11/23/join-the-east-const...

I’ve been using this style at home in a longtime now, I appreciate how it’s more at home with C and more consistent


It may look more consistent, but it misleads the reader about the (inconsistent) way the C declaration syntax works: in

  int *const p, *q;
p is constant but q is not, while in either of

  const int i, j;
  int const i, j;
both i and j are constant, and

  int i, const j;
is a syntax error. (This is still true but less relevant in C++, which made such an unintuitive mess of the declaration syntax that now the usual advice is to avoid declaring more than one variable per declaration at all.)


I don't see an inconsistency? The whole point of East Const is that the type to its left is constant. Clearly the pointer p is const and the pointer q isn't, because there's no way that asterisk next to q is affected by a const over on the far side of a comma.


Hm. I can see where you’re coming from, but I don’t think pretending the C declaration syntax is postfix or even involves a syntactic entity worthy of being called a “type” is particularly helpful (abstract declarators anyone? now Standard ML does in fact have a mostly-postfix type sublanguage). For example, what is the “type to the left” that the const in the constant function pointer declaration

  int *(*const f)(void);
is supposed to apply to?


Maybe you have higher standards for what constitutes a "type" than were prevalent in the 1970s. Obviously C isn't trying to be SML here but those are definitely types.

The type of f is: constant pointer to a function that takes no parameters but returns a pointer to an integer.

The thing to the const keyword's left (thus, the thing we know is constant because East Const is consistent) is an asterisk, indicating a pointer type, inside parentheses, a function, so that's a function pointer, and we can use the rules for figuring out the function declaration to determine what type of function it is.

Reading types this way has to be how you do it, all the time in C, const isn't in fact a weird exception, except in the sense that obviously constant ought to be the default, but it's too late for that in C.


I honestly don't mind going without C const (compared to C++ where it is more useful).


"const" in C is like underwear. Some people are more comfortable without it but it serves a real and practical purpose and there will be times when you really appreciate its appropriate use.


I used to use const everywhere and then one day I realised it clutters code and it's never caught a bug for me. I don't personally find it worth the extra time.

(Of course, I use const with pointers on interfaces for read-only referencing).


Agreed. I never use const if not required (unfortunately C++ forces you to use it). It is ugly and you're doing work that should be done by the compiler. Moreover, like everything in C++, it doesn't need to be enforced, so it is mostly a moot point.


Don't cast the result of malloc chanting always seems like C programmers getting envious of C++ legalese to me, who cares


I (quite clearly) care quite a lot. :) Of course that doesn't matter, but I do think avoiding pointless casts makes the code better. Fewer tokens to process and read, fewer more or less hidden assumptions on what is going on makes the code more clear and concise.

I can't understand why people are in favor of writing more code for no reason, especially casts which are sometimes mentioned as one of those things that make C annoying/dangerous/hard/bad.

If it's like "I feel it makes it more clear that the pointer is being converted from scary void pointer to cozy foo pointer", then I would argue that's exactly what a line of code like

    foo * const my_foo = malloc(sizeof *my_foo);
is saying, adding a cast on the right hand side of the assignment doesn't make that more clear. To me, it makes the code much more anxious-feeling, as if the programmer doesn't know what will happen without it, and just adds one "for good measure", which I consider a code smell.


One of the things that make C casts annoying/dangerous/hard/bad is that they are often implicit. If I didn't know malloc, then I wouldn't know that in your line of code there is a cast. And casts between pointers of different types can sometimes cause UB... There is a cast, whether you write it or not; it's just implicit, and that's bug-prone.


Okay, I see what you mean.

One obvious point here is of course that if you don't know malloc() then you don't know C. Considering C is a pretty small language, and that malloc() isn't exactly at the furthest, most dusty and least-trodden part of its standard library, I think this is basic knowledge for anyone writing C.

That said, I'm not sure I agree since implicit pointer conversions are not in general allowed, except to/from void * and then there will always be a properly typed thing close by (like the left-hand side in this example).

Can you provide an example of where an implicit pointer conversion creates danger?


> Can you provide an example of where an implicit pointer conversion creates danger?

In your malloc example, there's no real danger because you used sizeof. But when it is written with a type like this:

  bar = malloc(sizeof(BarType));
Then there's danger if the type of bar does not match BarType *, perhaps after changes.

Writing it like this:

  bar = (BarType *)malloc(sizeof(BarType))
ensures the type mismatch will cause a compile-time error instead of UB.

That's not a strong argument, obviously, as you can use sizeof(*bar) instead. But it is an example of a situation where an implicit pointer conversion creates danger.

(I would also argue that a malloc(sizeof(type)) is more idiomatic and familiar C, so using it with other people will raise fewer eyebrows even if it's not strictly the safer choice.)

So when would it occur without sizeof(*bar) being an option? When there's a type-erased container:

  foo = list_get(FooType, foo_list, index);
  bar = hash_get(BarType, bar_hash_table, key);
If those are generic containers whose accessors return void *, you will be in trouble with UB here if the type of *foo or *bar don't match FooType and BarType respectively, perhaps after changes.

When you add the cast, again the type mismatch will cause a compile-time error.

In this case, there's no need to write the cast at the call site. Since list_get and hash_get must be macros, it's better to put the cast inside those macros. But the cast should exist, to catch the type mismatch, rather than those returning void * and relying on implicit cast rules.

(If you want to get fancy, there's a way to avoid having to pass the container item-type to those macros as an argument, but I've never seen it used in practice.)


Even if you don't know malloc, you know that it returns void * or foo *. I think most C programmers consider implicit conversion between void * and other object pointers a feature. And yes, it's called an implicit conversion, not a cast -- cast is explicit.


The opposite is more dangerous. If you include the explicit cast and for some reason malloc was not declared (you forgot an include for example) then the implicit int return type that the compiler assumed will cause UB at runtime after you convert the result of malloc to a pointer of a different size.


malloc() returns a void*, that's enough of a hint that the result can be assigned to any other pointer type without requiring a cast.

Nowadays even C compilers warn when trying to assign "incompatible" pointer types.


Note that cproc uses another small but very interesting project as the backend, QBE[1]. It currently has support for x86-64 and arm64, and a risc-v port is in the making.

[1]: http://c9x.me/compile/


https://github.com/robertmuth/Cwerg is similar in nature it supports arm64 and arm32 and can directly generate elf executables.


Very interesting, thanks for sharing.


Would be nice to support it in Meson. Currently, there is work-in-progress TinyCC support [1]. But the work is hindered by the lack of recent TCC releases - 0.9.27 is too old to be supported.

[1] https://github.com/mesonbuild/meson/pull/8248


the approach of meson to hardcode the properties (including the name of the main executable!) of each and every compiler in the universe seems pretty short-sighted if not outright dumb and isn't quite how the designers of UNIX, the C language and compiler toolchains envisioned a build system to work.

if meson was properly designed, a user could just run `CC=whatevercc meson ...` and it would work with basically each and every single compiler available on *NIX and every POSIX compatible environment, including gcc, clang, tcc, cproc and even within cygwin. the only compiler i know that works differently is MSVC and it would be easier to just provide a wrapper that transforms POSIX-compatible compile command lines into ones that MSVC understands.

(this reminds me of the head-desk approach CMake takes to adding library dependencies: each library has a hardcoded configuration file, which can go as far as scanning your /usr/lib for libfoo.so rather than just assuming "-lfoo works, and if it doesn't, use $PKG_CONFIG --libs foo"...)

the author of cproc actually filed an issue report in 2019 pointing out this very fallacy, which is still open: https://github.com/mesonbuild/meson/issues/5406


Cproc+QBE doesn’t seem to be trying to do the linker job itself, so presumably library order (apparently a main point in that PR) does not need to be changed?


Probably makes sense to integrate a preprocessor before doing anything else, even somebody else's to get started.

Because that is pretty offputting for casually trying it on existing codebase to see what happens.


There is one, see pp.c.


Sorry for the stupid question, but what is the benefit of building a C compiler nowadays, and what is interesting about this particular one?


Your C compiler is probably a very large C++ program written over many years and with millions of dollars worth of time poured into it. The C compiler I use takes up hundreds of megabytes of space in a minimal build.

So in my opinion it's interesting if someone can write something original with 80% of the useful functionality on their own, or achieve it in a different way. And their code might be more approachable/educational to people than the behemoth.


The benefit is mainly that it is interesting and easy to learn/experiment because it has a very small and tidy implementation..

Just an example, building cproc from source takes a few seconds - building gcc from source takes like 15 minutes at least.


> very small and tidy implementation

Especially impressive for implementing the C11 spec and some GNU extensions.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: