Lessons from Writing a Compiler

the_clarence · on July 15, 2022

Interesting post! Personally I discovered all of these things by just making up some language and trying to compile it. I realized that parsing things was hard because of line breaks and words and whitespaces, so I wrote a lexer without knowning that it was a lexer and later realized that I had reinvented the wheel. Then I needed to store the spot in the source where each token had originated from, learned that it’s usually called span, and on and on. At some point I realized it was hard to figure out how to parse an expression without documenting the grammar, and then I learned about BNF. It’s actually not too hard to figure everything yourself and then read about how people usually solve a problem if you get stuck. Learn by doing.

Another good way to learn is to check the compiler of a language you already know and use. Then you grasp common terms like “kind” and “ty” and “span” that you can reuse.

You then realize things like: it’s better to have expressions be statements, as long as they don’t return a value. Otherwise you end up duplicating code for parsing statements and expressions (for example, for function calls)

> because importing the Error module would be a circular dependency. This is a problem as new types of errors are added.

This is really only a problem with OCaml, which is why it’s not a great language to write any big program in IMO (as you can’t easily prototype or split your logic into many modules easily without hitting dependency cycles)

smasher164 · on July 15, 2022

After reading through the Austral site, I have to say that the reasoning for its various features is very well articulated. This language ticks a lot of boxes. The only feedback I have to offer is to not skimp on the FFI, because the easier it is to call into C and C++ code, the easier the language will be to adopt (at least for the domain it seems to be targeting).

danybittel · on July 15, 2022

"Investing time in writing a good end-to-end test runner was a huge boon.."

I'd like to hear some details (features) of the test runner.

tsegratis · on July 14, 2022

That was a useful write up, thanks

It's more unusual to find writeups on implementation of semantics or typchecking, but as you say, more important

Keeping the env flat, and like a relational database (for typechecking) sounds like an idea i will take

the_clarence · on July 15, 2022

I’ve always wondered why Rust creates a unique id per atom and keeps track of type information in a hashmap using that unique id as key (instead of storing the type information within the atoms themselves). I guess it is to keep things flat, or reuse the same structure for different passes.

Other approaches like the nanopass approach creates different structs for each pass: https://legacy.cs.indiana.edu/~dyb/pubs/nano-jfp.pdf

deterministic · on July 17, 2022

> Writing a parser by hand is tricky and not worth the alleged performance increases

Not my experience at all. It is easy and fun. And I can actually debug the parser instead of staring at generated noise/tables. And the performance is usually x2 or more, making it worth it for compiling large projects. Background: I am the maintainer of a high performance JIT optimising compiler for a Haskell like rules language used in production by commercial companies around the world.

And yes tests are absolutely necessary to maintain a commercial optimising compiler. I use a test suite of more than 9000 tests. The result is zero bugs in production the last 5+ years. The upfront cost is absolutely worth it because I can make massive changes to the internals of the compiler and still confidently release the result into production.

As for type checking: you typically won’t find any useful stuff in compiler books. However there is a massive amount of information available on type checking and type systems online. It is a huge research area and kinda the foundation of computer science. There are whole books dedicated to type checking and type systems alone.

DylanSp · on July 15, 2022

Very good article. The focus on end-to-end testing, especially during early iteration, rings true to me; when I wrote a toy language/interpreter, I used a lot of unit tests for the individual stages. They definitely helped find and prevent bugs, but they took a lot of time to write, and they're closely coupled to the internal data structures.

_dain_ · on July 14, 2022

What does the Austral language bring to the table? Are there code examples?

zetalyrae · on July 14, 2022

You can think of it as either "Rust: The Good Parts" or "Modern Ada". It's a minimal language with linear types, capability-based security, an Ada-inspired module system, and strong typing. And it's designed with simplicity as the overriding goal: it's small enough that it fits in your head, and avoids becoming the Vasa[0].

By way of example: the borrow checker in the Rust compiler is 24.7k lines of code[1]. Austral's equivalent is 600 lines, because it's simply doing less. The linear type system is designed so that the rules fit in a page of text, and you learn them once and apply them everywhere, rather than fighting against an ever-changing, opaque pile of heuristics. It's much more like a normal type system and much less like static analysis.

There are a few example programs in the compiler repository[2] and the website[3].

[0]: https://www.stroustrup.com/P0977-remember-the-vasa.pdf

[1]: https://twitter.com/zetalyrae/status/1540883027743559682

[2]: https://github.com/austral/austral/tree/master/examples

[3]: https://austral.github.io/examples/

tialaramex · on July 14, 2022

Linear types are interesting, however I don't buy "Rust: The Good Parts".

In particular type inference (Rust has this, Austral does not) is a real quality of life improvement. If there's only one possible type X could be, what value is produced by insisting that I figure out how to spell that type to talk about X? This feels unimportant when X's type is an integer, but when it's an array of functions which return other functions that take an array of functions and return an integer, figuring out how to write what I'm doing is tiresome. The compiler knows, and I know, don't waste my time.

I presume that providing inference interferes with the ambition to have a small specification, because offering inference means either different implementations could have different inference rules, or else, those rules must be set out in full in the specification. However the quality of life improvement is real.

zetalyrae · on July 14, 2022

It's about readability. You write the types for documentation, and to make it easier for people doing a code walkthrough to simulate what is happening in their heads.

IDEs can add type annotations not present in the code, but there are many contexts (source control diffs, or code in a PDF, or code in a website) where that kind of software type annotation is not available.

I find that in OCaml, which has type inference for everything (function arguments, function return type, local variables, expressions) I end up writing the types almost everywhere, for two reasons. The first is documentation: it helps me to read and understand code I wrote ages ago.

But the second is that type inference often goes haywire, because the compiler doesn't know ahead of time whether the code you wrote is in error. So when you make a mistake, like forgetting to pass the right number of arguments to a curried function, the compiler will happily propagate that mistake around, even propagating the erroneously-inferred types to other functions. And you get these incomprehensible type error messages that come from places that are very separate from where the mistake actually happened.

And then I find myself taking a random walk through the code, adding type annotations to constrain the inference until I find my mistake. So why not skip the middle step, and require annotations?

Often, however, it isn't obvious what the type of an expression is going to be, and it's easier to construct the value level than the type level. For those cases you could simply force a type error to find out the type and write it in the text.

The simplicity reason is not that inference rules vary by compiler (the rules can be put in the spec), it's that type inference (as opposed to one-directional type propagation) can be complex to implement, and for simple extensions to an H-M type system it can become undecidable.

tialaramex · on July 14, 2022

Coming from mostly C and Java (where I didn't have inference) and Python (dynamic types, no real annotation when I was writing it) I found Rust's degree of inference (Constants, function signatures, structures must spell their types, but other types can be inferred if unambiguous) very satisfying, and I don't fancy going back.

Never written any OCaml in anger, and it has been decades since I wrote more ML of any sort, so perhaps in OCaml I'd find there's too much inference for my taste.

zetalyrae · on July 14, 2022

I think the widespread consensus view is that function signatures should have type annotations. Local variables are more contentious, but even then, you don't strictly need type inference to omit the type annotation in local variables (except for disambiguation).

I could easily modify the compiler to allow omitting types from `let` statements. Maybe for development you'd use annotation-free `let`s, and when publishing or otherwise finalizing code you have to write the type annotation.

I could be convinced to add this as a feature, but I tend to favor the strict and one-way-to-do-it approach.

samatman · on July 15, 2022

I will say that I find the Java-style `Widget widget = new Widget(junk);` infuriating, this barely counts as inference, it's more like lookahead. `my widget = Widget(junk)` is easier to look at and shorter to type, while conveying the same information.

So the easy win has my vote.

rightbyte · on July 15, 2022

Quite often it is:

    Widget w = new SpecialWidget (junk);

though. I kinda like verbosity. There is less magic to reason about in the language overall.

samatman · on July 15, 2022

Languages which use only inference have been written, but it's just not ergonomic, no one wants to use that. So the Widget is always available as an annotation (although I can't read your intent out of one line of Java clearly, and depending on what that is, a cleaner type system than Java's might take care of covariance well enough that we still wouldn't care), or as a cast.

Java itself introduced syntax for eliding the type of simple variables, so this is an old complaint, but yeah: a little bit of inference goes a long way.

the_clarence · on July 15, 2022

I really like Golang’s inference which is much dumber than Rust (left type = right type)

Rusky · on July 15, 2022

> offering inference means either different implementations could have different inference rules, or else, those rules must be set out in full in the specification

As a tangent, there is a third option (or a twist on the second), called "principle types": carefully design the type system so that there is only one possible type for each expression, or at least one "best" or "most general" type for each expression. The spec for this can be much smaller than a full type inference algorithm.

Gadiguibou · on July 14, 2022

It's really similar to the language I'm writing a compiler for atm :). (I also like the syntax and choice of keywords.)

I haven't heard much from the usefulness of linear types in real world scenarios and would like to learn more from someone with a bit more experience.

- Is it enough to grow to do things like concurrency?

- Are non-linear types allowed?

- Are references possible?

- In which case, does it fall back to a GC? ref counting?

zetalyrae · on July 14, 2022

1. I think it's good enough for concurrency. A big selling point of linear types is that, since there can only be a single reference to a linear value at any time, concurrency is trivial. You just send linear values through other threads on a channel, and they're consumed (from the point of view of the sending thread).

2. The set of types is divided into two universes: the Free universe (i.e. unrestricted) types like bool, int, records and unions containing other free types; and the Linear universe, containing linear types. So non-linear types are allowed and are the default for anything that's not a resource with a particular lifecycle (i.e., anything other than memory, file handles, socket handles, that kind of thing).

3. References are possible and they work like borrowing in Rust.

4. No refcounting or GC, it's done at compile time like in Rust.

tekknolagi · on July 14, 2022

You might enjoy a read through the programming language's website. I certainly did.

https://austral.github.io/

The spec has some nice code snippets:

https://austral.github.io/spec/

ki_ · on July 15, 2022

The big font is difficult to read (for me)

edit: nvm, i can just zoom out to 50%;

ahaferburg · on July 15, 2022

Or use reader mode.

renox · on July 15, 2022

What's Austral story for integer overflow? Division errors are explained but not the other errors, I guess that they abort too?

zetalyrae · on July 15, 2022

Integer arithmetic traps on overflow -- and that includes the often forgotten "divide by INT_MIN" case:

https://github.com/austral/austral/blob/01a6327815d1f9cc1097...

You have to use a different set of operators to get modular arithmetic semantics. So, the choice is not a flag you flip at build time, but a choice of operators.

renox · on July 15, 2022

Thanks for your answer.

>You have to use a different set of operators to get modular arithmetic semantics. So, the choice is not a flag you flip at build time, but a choice of operators.

Like Zig I think.

As for 'panic at overflow' I think that Erlang/Elixir is the same so there are probably a lot of 'design patterns'/libraries to get inspiration from there.

One note, your language allow usage of 'named parameters' as in "return File(handle => ptr);" so this means that the name of the parameter is part of the API and must be chosen carefully.. ptr isn't a very good name, handle would be a better name, but then of course you have "return File(handle => handle);" which is a classic problem of named parameters..

tsegratis · on July 14, 2022

Did you choose "No destructors" because of their effect on static analysis? -- Why are they an anti feature?

zetalyrae · on July 14, 2022

Destructors, exception handling, and the type system are all tied together, so I'm not sure where to begin explaining. These two sections in the rationale provide a complete explanation:

https://austral.github.io/spec/rationale-linear-types

https://austral.github.io/spec/rationale-error-handling

But basically: there are no destructors because there is no exception handling, and therefore no stack unwinding. The reason there's no exception handling is that the semantics (and implementation) of exception handling are incredibly complicated, and they make the code harder to reason about.

A design goal in Austral is "no hidden control flow". If it's not on the source code, it isn't happening.

Linear types are also not compatible with exception handling (which is why Rust uses affine types, sort of).

In place of destructors you have functions that consume linear values. These are ordinary functions and have to be called explicitly, calls to them are not inserted by the compiler.

For example, imagine that File is a linear type that holds a file handle. You might have something like:

    let f: File := openFile("foo.txt");
    -- code etc etc.
    closeFile(f);

Here, closeFile is a "destructor" in the sense that it takes a value of a linear type and returns nothing. It's not a destructor in the C++ sense because the call is not inserted for you at the end of a block or during stack unwinding.

tialaramex · on July 14, 2022

But, what happens if I have a Bunch of things and now I'm done with it?

In a language like C++ or Rust, if those things happen to be Files then when they're destroyed because I was done with the whole Bunch, the files get closed. [ This also means, although it would usually only happen intentionally in Rust, that you can leak the open files, that's what Rust's Box::leak does for example, but if it didn't exist you could do it yourself ]

I haven't written any Austral, but, if I have a Bunch of Files, am I now responsible for taking all the Files back out of the Bunch to closeFile them or else my program won't compile somehow? That seems like it'd suck for performance.

Or maybe I just can't put Files in a Bunch because they're Linear? That seems like an annoying limitation.

Or maybe I need to tell the Bunch to call closeFile on the Files when I get rid of the Bunch?

zetalyrae · on July 14, 2022

You'd take out each File from the Bunch and close it, and then deallocate the Bunch. This can be abstracted into a function.

>That seems like it'd suck for performance.

In either C++ or Rust, the exact same thing is happening. Except that you don't see it in the source code, because the compiler injects destructor calls for you at the end of a block.

But there's nothing magical about destructors. The exact same code is being executed as in Austral, only Austral has 1) no surprise control flow and 2) no hidden function calls, so you have to see the code that is going to be executed.

It might look something like this:

    -- bunch has type Bunch[File]
    while not isEmpty(&bunch) do
        let f: File := pop(&!bunch);
        closeFile(f);
    end while;
    disposeBunch(bunch);

tialaramex · on July 15, 2022

Did you try this? Because while you're correct in terms of the semantics, the performance (which is why I chose that word) is quite another matter.

Specifically it's not at all uncommon for some collection (such as our arbitrary Bunch) to offer a far cheaper way to throw away the whole Bunch than to remove every Thing in the bunch one at a time.

With a destruction mechanic this is fine, the collection destroys all the Files at once, but if the user must hand roll a loop like yours - taking each File out of the Bunch and then closeFile(file) then they can't take advantage of the improved performance.

zetalyrae · on July 15, 2022

It's not about removing each element individually vs. deallocating the whole collection. It's that you have to close each file handle individually.

In C++ you might iterate over the bunch, close each file, and then deallocate the whole thing. In Austral, because of how linear types work, you have to take a linear value out of the collection, destroy it, then, when the collection is empty, you can safely deallocate it.

There doesn't have to be a performance difference between the two because popping an element from a bunch does not imply resizing the bunch or doing any allocations.

In either language, the effectful actions are the same: close n file handles, deallocate the memory for a bunch.

tialaramex · on July 15, 2022

> It's that you have to close each file handle individually.

My emphasis. This is where the problem creeps in. Let's see that again:

> In C++ you might iterate over the bunch, close each file, and then deallocate the whole thing.

In Rust or C++ actually the Bunch will destroy and thus close all the files. Not me, I just destroyed the whole Bunch, not my problem how that works.

> There doesn't have to be a performance difference between the two

If your description had been more precise it might be apparent where a performance difference comes from. The Bunch knows the intimate details of its internal workings, so it may be able to do significantly less "housekeeping" while it is cleaning up. Without that knowledge however, when I am cleaning up the Bunch has no idea what my ultimate intent is and so it needs to do proper housekeeping.

I think you could try to mitigate that, at a cost of further language complexity, by introducing another Linear type, the ToBeDestroyedBunch which knows it is destined to be destroyed and is just here while the caller destroys the Things which were in the Bunch.

So, when I want rid of the whole Bunch I call a function with a Bunch, and I get back a ToBeDestroyedBunch. This needn't do all the housekeeping on each removal because we're obliged to dispose of it properly after we're finished and we can no longer do any other operations on a Bunch because we don't have a Bunch only a ToBeDestroyedBunch.

zetalyrae · on July 15, 2022

The Bunch API could easily include a function that takes a bunch and a function `t -> ()` that deallocates individual members, so that the logic is encapsulated in a function. You have to pass the destructor function explicitly because the language doesn't know which function is the "canonical" destructor (and a type may have multiple destructors, i.e. multiple ways to end its lifecycle).

gpderetta · on July 15, 2022

Yes, it seems to me that the only difference is that destructors are called implicitly in c++/rust and explicitly in austral.

I wonder if IDE support could help to get the best of both worlds: the IDE could automatically add the destructors call in code where needed, with an "automatically-generated" annotation that would allow to update and remove the calls as needed. I actually kind of like the idea.

Similar things could be done for inference/type deduction.

Possibly exceptions as well.

The ide could also allow hiding the automatically generated blocks when you want to focus on the code isntead of the incidental bits.

Probably these days new languages need to come with lsp-support out of the box, so that seems the best place to add the magic.

tialaramex · on July 15, 2022

Sure, the idea of potentially having different ends to the lifecycle makes sense to me, same way constructors aren't a thing in Rust.

And yes, I agree that this solves the problem, which was why I suggested it as one of three ways this could work if you look back up the thread. Not sure about how Austral can assure itself the Linear contractual obligations were discharged here, but I might be overthinking it.

tsegratis · on July 14, 2022

Okay, nice

Overall I'm impressed with your clarity of design -- coming from a language designer of many years, for whatever that is worth ;)

zetalyrae · on July 14, 2022

Thanks! I spent, I think, an embarrassingly long time on design. But I can at least say that though it is not perfect, it is internally coherent.

ArrayBoundCheck · on July 15, 2022

I dislike zig for lack of destructors and "no hidden control flow". I'll probably dislike your language too but is there one big thing difference from zig and austral?

zetalyrae · on July 15, 2022

The crucial difference is this[0]:

>It is the Zig programmer's responsibility to ensure that a pointer is not accessed when the memory pointed to is no longer available. Note that a slice is a form of pointer, in that it references other memory.

Zig is very much a manually-managed language. Austral has manual memory (and resource) management, but has type-system tools to help you ensure correctness.

[0]: https://ziglang.org/documentation/master/#Lifetime-and-Owner...

tester756 · on July 14, 2022

>The prototypical compilers textbook is:

Dragon Book?

astrange · on July 14, 2022

Yes, which is why it's a bad book and you shouldn't tell people to read it. Actual compilers tend to handwrite their parsers rather than using theory; parser generators aren't good at recovery and error messages.

zetalyrae · on July 14, 2022

I think writing a parser by hand should be reserved for when

1. You have users.

2. You can demonstrate a marked improvement in performance and/or error reporting with the hand-written parser.

Otherwise, deriving a parser from a grammar should absolutely be the default. Writing parsers by hand is error-prone, causes security vulnerabilities[0], and is very tedious work.

[0]: http://langsec.org/

pianoben · on July 15, 2022

Having written production-level compilers, my advice: use ANTLR for as long as humanly possible. It can remain adequate far, far longer than you think, and will save you months/years of stress.

When you outgrow it, you'll know - and by then, hopefully you'll have funding/users/contributors/etc to support the dev effort of rolling your own parser.

matijash · on July 15, 2022

Our compiler stack is in Haskell so we initially started with Parsec (parser combinator library) and then eventually switched to Happy (parser generator that Haskell uses itself) since at first we didn't have a formal grammar defined so that made us write a lot of code for every little addition to the language (it's a simple declarative DSL, https://github.com/wasp-lang/wasp).

The problem we're now facing with Happy is that it doesn't really have support for error recovery. We're trying to implement LSP so recovery is important in order to have proper syntax highlighting and error reporting in an IDE.

Wondering how well ANTLR covers this, I was quite surprised to find out that Happy has so little support for it.

DylanSp · on July 15, 2022

IIRC Happy (and Alex, the accompanying lexer generator) are pretty much straight ports of lex/yacc to Haskell; they predate some of the conveniences of newer parser generators like ANTLR.

rowanG077 · on July 15, 2022

I would have the opposite advice. Derived parsers have shit errors. Which is arguebly the most important feature of a parser besides being able to parse. And writing parsers by hand is trivial anyway. In a lot of cases if you squint a little you can see the BNF.

t-3 · on July 15, 2022

3. Your language has very little/no syntax.

A parser for a Forth is a one-liner or a short loop in most languages.

mhzsh · on July 15, 2022

4. You want to learn how to implement a parser by hand.

titzer · on July 15, 2022

Just write recursive descent, except for the infix operator precedence, and then use a tiny stack that you either push to or pop from (shift/reduce, ala LR) based on the relative precedence.

Pretty much all compilers that have fast parsers that also give good error messages are hand-written recursive descent.

mrslave · on July 15, 2022

It's a fine book and you should not tell people to read it.

It belongs in a first compilers course where you learn about how scanners and parsers work, not just "configure this then call `yylex`". So yeah, it nerds out about finite state machines for a little too long for most audiences. And you know your audience. Your audience wants to write a compiler today, so you are correct not to recommend this book.

defen · on July 14, 2022

> Actual compilers tend to handwrite their parsers rather than using theory; parser generators aren't good at recovery and error messages.

The linked article recommends using a parser generator. However in my experience I agree with you, parser generators aren't great in practice, which makes me hesitant to read the rest of the post.

zetalyrae · on July 14, 2022

The time investment of writing a good hand-written parser is only worth it if you have users whose workflow would be improved by performance/error reporting gains once you have exhausted what a grammar-derived parser can provide.

For Clang and GCC is it absolutely a no brainer to use a hand-written parser.

If you're trying to get to an MVP compiler, or you're building a bootstrapping compiler, using a grammar-derived parser to get to the finish line faster is far more sensible.

eatonphil · on July 15, 2022

Using parser generators doesn't make a lot of sense to me.

It's far easier to handwrite parsers.

Unless you're a midsized company with the resources and who doesn't mind bad error messages I would not recommend anyone use parser generators.

zetalyrae · on July 15, 2022

What could be easier than writing a grammar in EBNF-ish notation? This gives you the added benefit that the grammar is explicitly defined, and not implicit in the parser code.

westoncb · on July 15, 2022

> This gives you the added benefit that the grammar is explicitly defined

Especially if you're still refining the syntax: it is so comparatively easy to modify a grammar vs parser.

eatonphil · on July 15, 2022

> What could be easier than writing a grammar in EBNF-ish notation?

Not learning a new language/build system/coding paradigm. :)

kaba0 · on July 15, 2022

ANTLR took me an afternoon, before I could use it for a simple language, including adding it to my build tool.

pizza234 · on July 15, 2022

> It's far easier to handwrite parsers.

It's also far easier to make buggy handwritten parsers. I don't have much experience, but in the project that I've seen, where a handwritten parser was used, the issue tracker was full of parser bugs.

rhdunn · on July 15, 2022

I tend to write lots of tests at each level -- lexer, parser, AST, etc. -- to cover the different BNF grammar statements. I've done this with an XQuery and XPath parser I wrote for IntelliJ.

You can cover the different lexical variations in the lexical tests, and ensure that the lexer is reporting the correct things in the lexer tests. For these, I generally ensure I cover each token in a given BNF grammar statement, even if that means repeating checks for a given keyword, as it keeps the tests self-contained and easier to verify.

You can cover the different parser/grammar variations (different branches/paths in the BNF, missing tokens for error reporting and recovery, etc.) in the parser tests. These can make use of serializing the tree to a file, so your input is an example program and the output is the parsed AST tree.

The AST tests can then test the actual data model your AST provides, such as access to the value of a literal, or information about a type.

Higher-level tests such as constant folding can then be done using the AST API, or things like serializing the AST to a textual representation.

userbinator · on July 15, 2022

It's also far easier to debug handwritten parsers.

tomcam · on July 15, 2022

a parser generator can be a very good first pass at determining whether your language is properly formed

eatonphil · on July 15, 2022

What do you mean by properly formed?

Calavar · on July 15, 2022

Probably that it doesn't have ambiguities

tomcam · on July 15, 2022

Thanks, that’s a better way of putting it.

eatonphil · on July 15, 2022

Is that an important property? If so, why do you (*whoever) think so?

Calavar · on July 15, 2022

As always, it depends on what your goal is.

An unambiguous parser is easier to write tests for. It leads to fewer subtle "gotchas" where something syntax checks and compiles but is interpreted by the compiler as something very different than what the programmer intended. It's easier for third parties to write their own parsers and tooling for the language if they can target a formal grammar as opposed to the behavior of your Turing complete code.

But sometimes you don't care about any of those things. Especially if you're talking about a language with a small surface area. Or if you have to deal with legacy stuff (for example, parsing something like C++ that is heavily context sensitive, in which case trying to shoehorn things into a formal CFG so you can feed it to a generator might actually make things much harder than they have to be).

tomcam · on July 15, 2022

Not always, and that’s a great question! But I can help you catch things that might not be obvious when your first create the grammar, even such simple things as ambiguous nested if statements or errors in string formatting DSL syntax.

tomcam · on July 16, 2022

* it can, not I can

wyago · on July 15, 2022

It takes 1 or 2 hours to make a decent pratt parser for even somewhat unusual syntax.

dan-robertson · on July 15, 2022

Regarding parsing techniques: https://notes.eatonphil.com/parser-generators-vs-handwritten...

zabzonk · on July 14, 2022

I agree it is a bad book, but you don't seem to be explaining why.

astrange · on July 14, 2022

The article says exactly what I'd say. It spends far too much time explaining the frontend (parsing theory) which isn't the difficult or important part of the compiler, and it somehow makes it sound scary and mathematical.

The rest isn't covered enough and doesn't have enough practical real-world advice.

necheffa · on July 14, 2022

As someone who has had to clean up (with backwards compatibility) after people who YOLO'd custom grammars rather than actually taking the time to respect the mathematics of parsing and design ambiguity out up front...

I cannot disagree with you more.

astrange · on July 14, 2022

I mean, I think you should read a book on parsing theory, just not that one. I like "Parsing Techniques: A Practical Guide" but there might be a newer better one.

But clang and gcc both have handwritten recursive-descent parsers for C/C++. (C++ parsing of course is undecidable, or at least used to be.)

samatman · on July 15, 2022

C++ parsing being undecidable, and C having the Most Vexing Parse, are the classic arguments for developing a grammar using a parser generator, and only when the grammar is mature, write a handrolled implementation using the grammar as a reference.

I happen to think generating useful software from declarative grammar specifications is, hmm. Not limited to what such frameworks have been capable of to date. Although if you've only worked with yacc/bison, and never ANTLR, you may be underestimating the quality of a generated parser with respect to error recovery and messages. The state of the art has advanced, I don't think anyone would claim parity with artisanal handcrafted softwares, but we can get a lot more than "barf: shift-reduce bunchanumbers" these days.

zabzonk · on July 15, 2022

> and C having the Most Vexing Parse

C++ too. Just to be pedantic.

dgunay · on July 15, 2022

I can echo this sentiment. Last year I took a course for my master's where we implemented a compiler end to end. The dragon book helped for most of it but when it came time to implement the assembler we were unable to get it completely correct and the book was very scant on any information about how to best manage going from virtual to physical registers, handle storing information on the stack, etc. Granted some of that is platform specific but the difference in material from frontend to backend was stark.

mhh__ · on July 15, 2022

Parsing is also about 1% of the effort for a mature frontend.

Books do not cover how to actually structure a compiler and write sensible passes etc.

oxff · on July 15, 2022

Frontend is very important if you want your language to have good IDE tooling (which most languages do not have)

mhh__ · on July 15, 2022

I think engineering a compiler is a better book.

The dragon book has a few places which are really good and then a few places which were an afterthought shoved in there later like SSA (which is really important).

We are due a new batch of compiler books. As the utility of investment in compilers has decreased we are stuck with books mostly expecting x86 to die and be replaced with potentially statically schedulable machines so a lot of time is spent on algorithms that are really really fuzzy on a modern core.

Muchnick is a still a Bible for optimizations but it's written in a bizarre pseudo-language which means many algorithms are rather inpenetrable.

JITs are not very well covered in the literature. There's a lot of subtle variations in how they design their IRs these days.

sivakon · on July 15, 2022

so what’s the quintessential book for learning compilers (practical book)!

hardwaregeek · on July 14, 2022

I’d update this to Just Use Rust cause now you get the benefit of a decent package ecosystem and most of the benefits of OCaml. There are some messy lifetimes but something like Rowan and a couple arenas solves this.

As for parsing I think treesitter is pretty solid. You get the benefits of a parser generator but also a fast incremental CST parser.

TazeTSchnitzel · on July 15, 2022

The extra work required to deal with the lifetime of a graph is pretty unfriendly for a beginner. It's one of the few things for which, though I would be willing to use Rust myself, I might not recommend it to someone else.

the_clarence · on July 15, 2022

Just use Arc

the_clarence · on July 15, 2022

OCaml really messes with you if you try to prototype as you can’t have two modules that depend on one another. Rust is def. much better.