Interesting post! Personally I discovered all of these things by just making up some language and trying to compile it. I realized that parsing things was hard because of line breaks and words and whitespaces, so I wrote a lexer without knowning that it was a lexer and later realized that I had reinvented the wheel. Then I needed to store the spot in the source where each token had originated from, learned that it’s usually called span, and on and on. At some point I realized it was hard to figure out how to parse an expression without documenting the grammar, and then I learned about BNF. It’s actually not too hard to figure everything yourself and then read about how people usually solve a problem if you get stuck. Learn by doing.
Another good way to learn is to check the compiler of a language you already know and use. Then you grasp common terms like “kind” and “ty” and “span” that you can reuse.
You then realize things like: it’s better to have expressions be statements, as long as they don’t return a value. Otherwise you end up duplicating code for parsing statements and expressions (for example, for function calls)
> because importing the Error module would be a circular dependency. This is a problem as new types of errors are added.
This is really only a problem with OCaml, which is why it’s not a great language to write any big program in IMO (as you can’t easily prototype or split your logic into many modules easily without hitting dependency cycles)
After reading through the Austral site, I have to say that the reasoning for its various features is very well articulated. This language ticks a lot of boxes. The only feedback I have to offer is to not skimp on the FFI, because the easier it is to call into C and C++ code, the easier the language will be to adopt (at least for the domain it seems to be targeting).
I’ve always wondered why Rust creates a unique id per atom and keeps track of type information in a hashmap using that unique id as key (instead of storing the type information within the atoms themselves). I guess it is to keep things flat, or reuse the same structure for different passes.
> Writing a parser by hand is tricky and not worth the alleged performance increases
Not my experience at all. It is easy and fun. And I can actually debug the parser instead of staring at generated noise/tables. And the performance is usually x2 or more, making it worth it for compiling large projects. Background: I am the maintainer of a high performance JIT optimising compiler for a Haskell like rules language used in production by commercial companies around the world.
And yes tests are absolutely necessary to maintain a commercial optimising compiler. I use a test suite of more than 9000 tests. The result is zero bugs in production the last 5+ years. The upfront cost is absolutely worth it because I can make massive changes to the internals of the compiler and still confidently release the result into production.
As for type checking: you typically won’t find any useful stuff in compiler books. However there is a massive amount of information available on type checking and type systems online. It is a huge research area and kinda the foundation of computer science. There are whole books dedicated to type checking and type systems alone.
Very good article. The focus on end-to-end testing, especially during early iteration, rings true to me; when I wrote a toy language/interpreter, I used a lot of unit tests for the individual stages. They definitely helped find and prevent bugs, but they took a lot of time to write, and they're closely coupled to the internal data structures.
You can think of it as either "Rust: The Good Parts" or "Modern Ada". It's a minimal language with linear types, capability-based security, an Ada-inspired module system, and strong typing. And it's designed with simplicity as the overriding goal: it's small enough that it fits in your head, and avoids becoming the Vasa[0].
By way of example: the borrow checker in the Rust compiler is 24.7k lines of code[1]. Austral's equivalent is 600 lines, because it's simply doing less. The linear type system is designed so that the rules fit in a page of text, and you learn them once and apply them everywhere, rather than fighting against an ever-changing, opaque pile of heuristics. It's much more like a normal type system and much less like static analysis.
There are a few example programs in the compiler repository[2] and the website[3].
Linear types are interesting, however I don't buy "Rust: The Good Parts".
In particular type inference (Rust has this, Austral does not) is a real quality of life improvement. If there's only one possible type X could be, what value is produced by insisting that I figure out how to spell that type to talk about X? This feels unimportant when X's type is an integer, but when it's an array of functions which return other functions that take an array of functions and return an integer, figuring out how to write what I'm doing is tiresome. The compiler knows, and I know, don't waste my time.
I presume that providing inference interferes with the ambition to have a small specification, because offering inference means either different implementations could have different inference rules, or else, those rules must be set out in full in the specification. However the quality of life improvement is real.
It's about readability. You write the types for documentation, and to make it easier for people doing a code walkthrough to simulate what is happening in their heads.
IDEs can add type annotations not present in the code, but there are many contexts (source control diffs, or code in a PDF, or code in a website) where that kind of software type annotation is not available.
I find that in OCaml, which has type inference for everything (function arguments, function return type, local variables, expressions) I end up writing the types almost everywhere, for two reasons. The first is documentation: it helps me to read and understand code I wrote ages ago.
But the second is that type inference often goes haywire, because the compiler doesn't know ahead of time whether the code you wrote is in error. So when you make a mistake, like forgetting to pass the right number of arguments to a curried function, the compiler will happily propagate that mistake around, even propagating the erroneously-inferred types to other functions. And you get these incomprehensible type error messages that come from places that are very separate from where the mistake actually happened.
And then I find myself taking a random walk through the code, adding type annotations to constrain the inference until I find my mistake. So why not skip the middle step, and require annotations?
Often, however, it isn't obvious what the type of an expression is going to be, and it's easier to construct the value level than the type level. For those cases you could simply force a type error to find out the type and write it in the text.
The simplicity reason is not that inference rules vary by compiler (the rules can be put in the spec), it's that type inference (as opposed to one-directional type propagation) can be complex to implement, and for simple extensions to an H-M type system it can become undecidable.
Coming from mostly C and Java (where I didn't have inference) and Python (dynamic types, no real annotation when I was writing it) I found Rust's degree of inference (Constants, function signatures, structures must spell their types, but other types can be inferred if unambiguous) very satisfying, and I don't fancy going back.
Never written any OCaml in anger, and it has been decades since I wrote more ML of any sort, so perhaps in OCaml I'd find there's too much inference for my taste.
I think the widespread consensus view is that function signatures should have type annotations. Local variables are more contentious, but even then, you don't strictly need type inference to omit the type annotation in local variables (except for disambiguation).
I could easily modify the compiler to allow omitting types from `let` statements. Maybe for development you'd use annotation-free `let`s, and when publishing or otherwise finalizing code you have to write the type annotation.
I could be convinced to add this as a feature, but I tend to favor the strict and one-way-to-do-it approach.
I will say that I find the Java-style `Widget widget = new Widget(junk);` infuriating, this barely counts as inference, it's more like lookahead. `my widget = Widget(junk)` is easier to look at and shorter to type, while conveying the same information.
Languages which use only inference have been written, but it's just not ergonomic, no one wants to use that. So the Widget is always available as an annotation (although I can't read your intent out of one line of Java clearly, and depending on what that is, a cleaner type system than Java's might take care of covariance well enough that we still wouldn't care), or as a cast.
Java itself introduced syntax for eliding the type of simple variables, so this is an old complaint, but yeah: a little bit of inference goes a long way.
> offering inference means either different implementations could have different inference rules, or else, those rules must be set out in full in the specification
As a tangent, there is a third option (or a twist on the second), called "principle types": carefully design the type system so that there is only one possible type for each expression, or at least one "best" or "most general" type for each expression. The spec for this can be much smaller than a full type inference algorithm.
1. I think it's good enough for concurrency. A big selling point of linear types is that, since there can only be a single reference to a linear value at any time, concurrency is trivial. You just send linear values through other threads on a channel, and they're consumed (from the point of view of the sending thread).
2. The set of types is divided into two universes: the Free universe (i.e. unrestricted) types like bool, int, records and unions containing other free types; and the Linear universe, containing linear types. So non-linear types are allowed and are the default for anything that's not a resource with a particular lifecycle (i.e., anything other than memory, file handles, socket handles, that kind of thing).
3. References are possible and they work like borrowing in Rust.
4. No refcounting or GC, it's done at compile time like in Rust.
You have to use a different set of operators to get modular arithmetic semantics. So, the choice is not a flag you flip at build time, but a choice of operators.
>You have to use a different set of operators to get modular arithmetic semantics. So, the choice is not a flag you flip at build time, but a choice of operators.
Like Zig I think.
As for 'panic at overflow' I think that Erlang/Elixir is the same so there are probably a lot of 'design patterns'/libraries to get inspiration from there.
One note, your language allow usage of 'named parameters' as in "return File(handle => ptr);" so this means that the name of the parameter is part of the API and must be chosen carefully.. ptr isn't a very good name, handle would be a better name, but then of course you have "return File(handle => handle);" which is a classic problem of named parameters..
Destructors, exception handling, and the type system are all tied together, so I'm not sure where to begin explaining. These two sections in the rationale provide a complete explanation:
But basically: there are no destructors because there is no exception handling, and therefore no stack unwinding. The reason there's no exception handling is that the semantics (and implementation) of exception handling are incredibly complicated, and they make the code harder to reason about.
A design goal in Austral is "no hidden control flow". If it's not on the source code, it isn't happening.
Linear types are also not compatible with exception handling (which is why Rust uses affine types, sort of).
In place of destructors you have functions that consume linear values. These are ordinary functions and have to be called explicitly, calls to them are not inserted by the compiler.
For example, imagine that File is a linear type that holds a file handle. You might have something like:
let f: File := openFile("foo.txt");
-- code etc etc.
closeFile(f);
Here, closeFile is a "destructor" in the sense that it takes a value of a linear type and returns nothing. It's not a destructor in the C++ sense because the call is not inserted for you at the end of a block or during stack unwinding.
But, what happens if I have a Bunch of things and now I'm done with it?
In a language like C++ or Rust, if those things happen to be Files then when they're destroyed because I was done with the whole Bunch, the files get closed. [ This also means, although it would usually only happen intentionally in Rust, that you can leak the open files, that's what Rust's Box::leak does for example, but if it didn't exist you could do it yourself ]
I haven't written any Austral, but, if I have a Bunch of Files, am I now responsible for taking all the Files back out of the Bunch to closeFile them or else my program won't compile somehow? That seems like it'd suck for performance.
Or maybe I just can't put Files in a Bunch because they're Linear? That seems like an annoying limitation.
Or maybe I need to tell the Bunch to call closeFile on the Files when I get rid of the Bunch?
You'd take out each File from the Bunch and close it, and then deallocate the Bunch. This can be abstracted into a function.
>That seems like it'd suck for performance.
In either C++ or Rust, the exact same thing is happening. Except that you don't see it in the source code, because the compiler injects destructor calls for you at the end of a block.
But there's nothing magical about destructors. The exact same code is being executed as in Austral, only Austral has 1) no surprise control flow and 2) no hidden function calls, so you have to see the code that is going to be executed.
It might look something like this:
-- bunch has type Bunch[File]
while not isEmpty(&bunch) do
let f: File := pop(&!bunch);
closeFile(f);
end while;
disposeBunch(bunch);
Did you try this? Because while you're correct in terms of the semantics, the performance (which is why I chose that word) is quite another matter.
Specifically it's not at all uncommon for some collection (such as our arbitrary Bunch) to offer a far cheaper way to throw away the whole Bunch than to remove every Thing in the bunch one at a time.
With a destruction mechanic this is fine, the collection destroys all the Files at once, but if the user must hand roll a loop like yours - taking each File out of the Bunch and then closeFile(file) then they can't take advantage of the improved performance.
It's not about removing each element individually vs. deallocating the whole collection. It's that you have to close each file handle individually.
In C++ you might iterate over the bunch, close each file, and then deallocate the whole thing. In Austral, because of how linear types work, you have to take a linear value out of the collection, destroy it, then, when the collection is empty, you can safely deallocate it.
There doesn't have to be a performance difference between the two because popping an element from a bunch does not imply resizing the bunch or doing any allocations.
In either language, the effectful actions are the same: close n file handles, deallocate the memory for a bunch.
> It's that you have to close each file handle individually.
My emphasis. This is where the problem creeps in. Let's see that again:
> In C++ you might iterate over the bunch, close each file, and then deallocate the whole thing.
In Rust or C++ actually the Bunch will destroy and thus close all the files. Not me, I just destroyed the whole Bunch, not my problem how that works.
> There doesn't have to be a performance difference between the two
If your description had been more precise it might be apparent where a performance difference comes from. The Bunch knows the intimate details of its internal workings, so it may be able to do significantly less "housekeeping" while it is cleaning up. Without that knowledge however, when I am cleaning up the Bunch has no idea what my ultimate intent is and so it needs to do proper housekeeping.
I think you could try to mitigate that, at a cost of further language complexity, by introducing another Linear type, the ToBeDestroyedBunch which knows it is destined to be destroyed and is just here while the caller destroys the Things which were in the Bunch.
So, when I want rid of the whole Bunch I call a function with a Bunch, and I get back a ToBeDestroyedBunch. This needn't do all the housekeeping on each removal because we're obliged to dispose of it properly after we're finished and we can no longer do any other operations on a Bunch because we don't have a Bunch only a ToBeDestroyedBunch.
The Bunch API could easily include a function that takes a bunch and a function `t -> ()` that deallocates individual members, so that the logic is encapsulated in a function. You have to pass the destructor function explicitly because the language doesn't know which function is the "canonical" destructor (and a type may have multiple destructors, i.e. multiple ways to end its lifecycle).
Yes, it seems to me that the only difference is that destructors are called implicitly in c++/rust and explicitly in austral.
I wonder if IDE support could help to get the best of both worlds: the IDE could automatically add the destructors call in code where needed, with an "automatically-generated" annotation that would allow to update and remove the calls as needed. I actually kind of like the idea.
Similar things could be done for inference/type deduction.
Possibly exceptions as well.
The ide could also allow hiding the automatically generated blocks when you want to focus on the code isntead of the incidental bits.
Probably these days new languages need to come with lsp-support out of the box, so that seems the best place to add the magic.
Sure, the idea of potentially having different ends to the lifecycle makes sense to me, same way constructors aren't a thing in Rust.
And yes, I agree that this solves the problem, which was why I suggested it as one of three ways this could work if you look back up the thread. Not sure about how Austral can assure itself the Linear contractual obligations were discharged here, but I might be overthinking it.
I dislike zig for lack of destructors and "no hidden control flow". I'll probably dislike your language too but is there one big thing difference from zig and austral?
>It is the Zig programmer's responsibility to ensure that a pointer is not accessed when the memory pointed to is no longer available. Note that a slice is a form of pointer, in that it references other memory.
Zig is very much a manually-managed language. Austral has manual memory (and resource) management, but has type-system tools to help you ensure correctness.
Yes, which is why it's a bad book and you shouldn't tell people to read it. Actual compilers tend to handwrite their parsers rather than using theory; parser generators aren't good at recovery and error messages.
I think writing a parser by hand should be reserved for when
1. You have users.
2. You can demonstrate a marked improvement in performance and/or error reporting with the hand-written parser.
Otherwise, deriving a parser from a grammar should absolutely be the default. Writing parsers by hand is error-prone, causes security vulnerabilities[0], and is very tedious work.
Having written production-level compilers, my advice: use ANTLR for as long as humanly possible. It can remain adequate far, far longer than you think, and will save you months/years of stress.
When you outgrow it, you'll know - and by then, hopefully you'll have funding/users/contributors/etc to support the dev effort of rolling your own parser.
Our compiler stack is in Haskell so we initially started with Parsec (parser combinator library) and then eventually switched to Happy (parser generator that Haskell uses itself) since at first we didn't have a formal grammar defined so that made us write a lot of code for every little addition to the language (it's a simple declarative DSL, https://github.com/wasp-lang/wasp).
The problem we're now facing with Happy is that it doesn't really have support for error recovery. We're trying to implement LSP so recovery is important in order to have proper syntax highlighting and error reporting in an IDE.
Wondering how well ANTLR covers this, I was quite surprised to find out that Happy has so little support for it.
IIRC Happy (and Alex, the accompanying lexer generator) are pretty much straight ports of lex/yacc to Haskell; they predate some of the conveniences of newer parser generators like ANTLR.
I would have the opposite advice. Derived parsers have shit errors. Which is arguebly the most important feature of a parser besides being able to parse. And writing parsers by hand is trivial anyway. In a lot of cases if you squint a little you can see the BNF.
Just write recursive descent, except for the infix operator precedence, and then use a tiny stack that you either push to or pop from (shift/reduce, ala LR) based on the relative precedence.
Pretty much all compilers that have fast parsers that also give good error messages are hand-written recursive descent.
It's a fine book and you should not tell people to read it.
It belongs in a first compilers course where you learn about how scanners and parsers work, not just "configure this then call `yylex`". So yeah, it nerds out about finite state machines for a little too long for most audiences. And you know your audience. Your audience wants to write a compiler today, so you are correct not to recommend this book.
> Actual compilers tend to handwrite their parsers rather than using theory; parser generators aren't good at recovery and error messages.
The linked article recommends using a parser generator. However in my experience I agree with you, parser generators aren't great in practice, which makes me hesitant to read the rest of the post.
The time investment of writing a good hand-written parser is only worth it if you have users whose workflow would be improved by performance/error reporting gains once you have exhausted what a grammar-derived parser can provide.
For Clang and GCC is it absolutely a no brainer to use a hand-written parser.
If you're trying to get to an MVP compiler, or you're building a bootstrapping compiler, using a grammar-derived parser to get to the finish line faster is far more sensible.
What could be easier than writing a grammar in EBNF-ish notation? This gives you the added benefit that the grammar is explicitly defined, and not implicit in the parser code.
It's also far easier to make buggy handwritten parsers. I don't have much experience, but in the project that I've seen, where a handwritten parser was used, the issue tracker was full of parser bugs.
I tend to write lots of tests at each level -- lexer, parser, AST, etc. -- to cover the different BNF grammar statements. I've done this with an XQuery and XPath parser I wrote for IntelliJ.
You can cover the different lexical variations in the lexical tests, and ensure that the lexer is reporting the correct things in the lexer tests. For these, I generally ensure I cover each token in a given BNF grammar statement, even if that means repeating checks for a given keyword, as it keeps the tests self-contained and easier to verify.
You can cover the different parser/grammar variations (different branches/paths in the BNF, missing tokens for error reporting and recovery, etc.) in the parser tests. These can make use of serializing the tree to a file, so your input is an example program and the output is the parsed AST tree.
The AST tests can then test the actual data model your AST provides, such as access to the value of a literal, or information about a type.
Higher-level tests such as constant folding can then be done using the AST API, or things like serializing the AST to a textual representation.
An unambiguous parser is easier to write tests for. It leads to fewer subtle "gotchas" where something syntax checks and compiles but is interpreted by the compiler as something very different than what the programmer intended. It's easier for third parties to write their own parsers and tooling for the language if they can target a formal grammar as opposed to the behavior of your Turing complete code.
But sometimes you don't care about any of those things. Especially if you're talking about a language with a small surface area. Or if you have to deal with legacy stuff (for example, parsing something like C++ that is heavily context sensitive, in which case trying to shoehorn things into a formal CFG so you can feed it to a generator might actually make things much harder than they have to be).
Not always, and that’s a great question! But I can help you catch things that might not be obvious when your first create the grammar, even such simple things as ambiguous nested if statements or errors in string formatting DSL syntax.
The article says exactly what I'd say. It spends far too much time explaining the frontend (parsing theory) which isn't the difficult or important part of the compiler, and it somehow makes it sound scary and mathematical.
The rest isn't covered enough and doesn't have enough practical real-world advice.
As someone who has had to clean up (with backwards compatibility) after people who YOLO'd custom grammars rather than actually taking the time to respect the mathematics of parsing and design ambiguity out up front...
I mean, I think you should read a book on parsing theory, just not that one. I like "Parsing Techniques: A Practical Guide" but there might be a newer better one.
But clang and gcc both have handwritten recursive-descent parsers for C/C++. (C++ parsing of course is undecidable, or at least used to be.)
C++ parsing being undecidable, and C having the Most Vexing Parse, are the classic arguments for developing a grammar using a parser generator, and only when the grammar is mature, write a handrolled implementation using the grammar as a reference.
I happen to think generating useful software from declarative grammar specifications is, hmm. Not limited to what such frameworks have been capable of to date. Although if you've only worked with yacc/bison, and never ANTLR, you may be underestimating the quality of a generated parser with respect to error recovery and messages. The state of the art has advanced, I don't think anyone would claim parity with artisanal handcrafted softwares, but we can get a lot more than "barf: shift-reduce bunchanumbers" these days.
I can echo this sentiment. Last year I took a course for my master's where we implemented a compiler end to end. The dragon book helped for most of it but when it came time to implement the assembler we were unable to get it completely correct and the book was very scant on any information about how to best manage going from virtual to physical registers, handle storing information on the stack, etc. Granted some of that is platform specific but the difference in material from frontend to backend was stark.
The dragon book has a few places which are really good and then a few places which were an afterthought shoved in there later like SSA (which is really important).
We are due a new batch of compiler books. As the utility of investment in compilers has decreased we are stuck with books mostly expecting x86 to die and be replaced with potentially statically schedulable machines so a lot of time is spent on algorithms that are really really fuzzy on a modern core.
Muchnick is a still a Bible for optimizations but it's written in a bizarre pseudo-language which means many algorithms are rather inpenetrable.
JITs are not very well covered in the literature. There's a lot of subtle variations in how they design their IRs these days.
I’d update this to Just Use Rust cause now you get the benefit of a decent package ecosystem and most of the benefits of OCaml. There are some messy lifetimes but something like Rowan and a couple arenas solves this.
As for parsing I think treesitter is pretty solid. You get the benefits of a parser generator but also a fast incremental CST parser.
The extra work required to deal with the lifetime of a graph is pretty unfriendly for a beginner. It's one of the few things for which, though I would be willing to use Rust myself, I might not recommend it to someone else.
Another good way to learn is to check the compiler of a language you already know and use. Then you grasp common terms like “kind” and “ty” and “span” that you can reuse.
You then realize things like: it’s better to have expressions be statements, as long as they don’t return a value. Otherwise you end up duplicating code for parsing statements and expressions (for example, for function calls)
> because importing the Error module would be a circular dependency. This is a problem as new types of errors are added.
This is really only a problem with OCaml, which is why it’s not a great language to write any big program in IMO (as you can’t easily prototype or split your logic into many modules easily without hitting dependency cycles)