For someone like me who always goes, "Comments? Meh. The code should be the comments!" it's an eye-opener.
The code has a lot of comments. But not this kind:
i++; // Increment i
It's about 200 lines of actual code plus another 500 lines of comments explaining what is going on. And 75 lines of old-school ASCII header art like you used to see on line printers back in the '60s, just to appeal to old-timers like me.
It takes you through tokenizing raw code, parsing the tokens into an AST, transforming the AST into a form the code generator can use, and then generating code from it.
A great example of how to teach a concept with a small amount of code and informative comments.
I think a Lisp compiler isn't quite as interesting for beginners as something that handles the more usual syntax with precedence etc., since it's so trivial that it leaves the "how does a 'real' compiler parse" question unanswered.
In that respect, I would suggest beginning with a simple recursive-descent parser that can handle the usual maths expressions - then extending it to e.g. all the operators in a C-like language is not a huge leap (table-driven precedence-climbing can be derived from recursive descent and makes this much easier), and it also leads to the opportunity for a compiler that can parse and compile itself - unless you are also writing the compiler in Lisp.
The AST stuff can come later; with expressions, the parser can perform evaluation as it parses, which gives the opportunity to write a calculator. Generating an AST is just replacement of evaluation with creation of tree nodes.
IMHO "it can compile itself" (as well as other non-trivial programs) is a good goal, and really makes people pay attention because it dispels the notion that compilers are somehow "magic".
> In that respect, I would suggest beginning with a simple recursive-descent parser that can handle the usual maths expressions - then extending it to e.g. all the operators in a C-like language is not a huge leap (table-driven precedence-climbing can be derived from recursive descent and makes this much easier), and it also leads to the opportunity for a compiler that can parse and compile itself - unless you are also writing the compiler in Lisp.
I'd suggest starting with a Top Down Operator Precedence ("Pratt") parser, they're beautiful magic for the working man, highly declarative and slightly mind-bending in that they seem so simple they shouldn't work, they make expressions with infix priority a joy to handle.
TDOP, Pratt, precedence-climbing, I think they're all referring to the same, if not extremely similar, method of using a loop in the main recursive function that compares precedences and decides whether to recurse or loop:
They're definitely very simple, and can be derived by refactoring a recursive descent parser and observing that there's no need to make a chain of nested recursive calls to the right level when the function can be called directly to "jump" to the desired level. The simplicity is probably why it's been rediscovered several times and has different names.
TDOP and Pratt are exactly the same thing, Vaughan Pratt is the inventor, Top Down Operator Precedence is the name he gave to the algorithm. And TDOP never "recurses or loops" (the precedence tells it whether it should stop looping). Part of the interest is how object-oriented it is, the main parsing loop ("expression" in the Crockford article) just tells a token "parse yourself as prefix" or "parse yourself as infix preceded by (AST)"
> since it's so trivial that it leaves the "how does a 'real' compiler parse" question unanswered.
I'm sure it is tempting to bake everything into this project. After implementing a "real" parser, you might want to implement a "real" transform, an optimizing step perhaps, you could keep going until it's totally unrecognizable (https://github.com/EnterpriseQualityCoding/FizzBuzzEnterpris...)... the "give a mouse a cookie" effect applies here.
However, trying to teach everything at once isn't an effective way to teach people. I think this is largely why compilers are so unapproachable to most people.
I don't expect someone who has taken even just a simple Compilers 101 class to get anything out of this project. I'm hoping that I can make someone previously unaware of how compilers work to feeling somewhat comfortable with them. For a project that you can read through in 20-30 minutes, I think that's already a pretty big goal.
Self-hoisted compilers are cool, but that would contradict the goal of this project.
Shameless plug: I wrote a very compact and generic lexer, parser and interpreter for a small C-like language. I believe it can be of a great educational value for anyone who tries to understand the basic mechanics of parsing an imperative curly-brace language:
Maybe at the edge when it doesn't mean anything straightforward like a 'car++' in a meta class where a better name won't help or would be too long. But ya.
// And a `tokens` array for pushing our tokens to.
var tokens = [];
// We're going to create a `value` string that we are going to push
// characters to.
var value = '';
Sure OP tried to be super descriptive but there a few comments that you simply don't need at all with at least nice a little understanding.
in a case like this, i actually like the dialogue/conversation nature of the comments. i think the opinion of "not needing them" seems miss what they're accomplishing.
Yes, an experienced dev can infer intent from context (name, project, 'culture'), and even a smart beginner can read ahead to infer intent. But the comment resolves ambiguity in the advanced dev's mind, and saves the beginner's mind from the difficulty of read-ahead inferences (of which he may have several going at once).
I am someone who doesnt know much about compilers and I found them helpful. Setting up each variables purpose builds up context in my mind lessening the struggle with the 'real' code.
That was my first thought too. Comments can sometimes be seen as an afterthought, something that you feel obligated to leave for future programmers or managers craving documentation. In cases like this, comments become an active part of the program, documenting and teaching parts of the code to anyone who wants or needs to understand.
It inspires me to leave more thorough, detailed comments, not for my own benefit, but for that of others. This almost reads like a really well-done technical paper in parts. Very neat.
I only glanced at it cuz Im at work. Yet, I found myself glancing longer than I shouldve because of the very things you mentioned. It was an amazing firsr impression.
I'll be looking ahain despite not knowing JavaScript. I bet I jnderstand most of it anyway given attention to readability.
If you’re interested this is called “litterate programming” [1], a form of programming where instead of having a few comments in a large program, you have a few lines of program in a large comment. It’s particularly adapted to tutorials like this one.
Nitpick: this is not literate programming; it's good commenting.
Literate programming means the flow of the code follows the flow of the thought process: you write the story, and insert code where relevant. It includes other bits of code as macros.
From that Wikipedia page:
The literate programming paradigm, as conceived by Knuth, represents a move away from writing programs in the manner and order imposed by the computer, and instead enables programmers to develop programs in the order demanded by the logic and flow of their thoughts.[2]
...
Order of human logic, not that of the compiler
...
Instead of comments provided as side notes to source code a literate program contains the explanation of concepts on each level, with lower level concepts deferred to their appropriate place, which allows for better communication of thought.
Example (same page):
The purpose of wc is to count lines, words, and/or characters in a list of files. The
number of lines in a file is ......../more explanations/
Here, then, is an overview of the file wc.c that is defined by the noweb program wc.nw:
<<*>>=
<<Header files to include>>
<<Definitions>>
<<Global variables>>
<<Functions>>
<<The main program>>
@
We must include the standard I/O definitions, since we want to send formatted output
to stdout and stderr.
<<Header files to include>>=
#include <stdio.h>
@
Not to say these comments aren't good; they're great stuff, they make it very accessible and understandable! But it's not Literate Programming. Just Good Programming :)
If you don't need something as super-tiny, but don't want to go all-out Dragon book, Niklaus Wirth has released an updated copy of his classic "Compiler Construction" for free, now targeting a subset of his Oberon language. All in 130 pages. No fancy functional methods aimed at a language high in the Chomsky hierarchy, but all the basics are there.
(I consider it one of IT's biggest losses that we don't follow Wirth more, go notwithstanding)
As an experienced programmer who only recently got into compilers, I have a few, open ended, somewhat advanced, questions about the transformer phase (for anyone to comment, thanks in advance).
It appears to me that the transformer is like the guts of the compiler, everything else is more like support system. And more often than not, the purpose of transformation ends up being 'optimization', but of course you could do transformation for other purposes. However this phase ends up being very hairy logic if the programmer is not careful.
Anyway, my questions are:
- is there any consensus on creating a DSL for providing transformation rules instead of hand coding them in the programming language?
- There is the nanopass compiler framework in scheme [0]. Is that one of the superior ways of doing things? or are there any serious criticisms about that technique?
- Is there a relationship between compiler transformations and 'term-rewriting systems'?
- OMeta/OMeta2 pattern matching on context-free-languages?
Whether it's nanopass-framework, OMeta2, term-rewriting system, or some other DSL, I guess what I'm trying to get at is how can that phase be made more 'formal', so I guess at this point we get into the area of 'formal systems' and 'formal semantics' etc, etc.
I would appreciate if there is recommended reading (books, survey papers, etc) on this topic.
There are a wide variety of techniques. Some compilers use a single IR (intermediate language) as the input and output for each optimization pass. Others use multiple IR's depending on the phase.
If you have a bunch of optimization passes and each pass visits the entire program, the compiler can become slow pretty quickly. Also, there's nothing preventing passes from running in the wrong order, and they may have implicit dependencies that aren't well documented.
I wrote a shift-reduce parser for a minimalistic C-like language where the grammar is entirely specified as a static table. The precedence rules are resolved through a "should_shift" function where the parser asks whether it should perform the reduction for a matched rule, or shift one token more:
Jon Harrop had an article in the F# Journal a while back where he wrote an x86 JIT compiler for a small functional language in about 200 lines of code. It could express Fibonacci programs, and the compiler's performance beat the equivalent F# program on that toy benchmark.
As an aside, I've noticed that some modern compilers (Go and Clang are the ones I've studied recently) bundle the tokenization phase and the lexing phase into a higher-level lexer. Which is to say that instead of, say, turning a token pair (PUNCTUATION, "&&") they produce (LOGICAL_AND, "&&"), for example. It makes sense, but surprised me, since if you look at classical compiler books, they generally promote the more formal two-level pipeline of tokenization before lexical analysis.
It's well written. It's clever. The comments style does a great job of explaining the concepts.
My only gripe is, although translating one high level language (lisp) to another (c) is technically a compiler, in practice most people think of something producing output at a lower level of abstraction than the input.
This would be even neater if it could emit assembly.
I get your point and agree about assembly, but I also think most people would agree that C is a lower level of abstraction than LISP.
One reason to "stop" at C might be that assembly itself tends to be a bit verbose, so not very suitable when aiming for shortness of code as a major feature. Generating assembly would perhaps obscure the things being taught with a bunch of details that are of course interesting, but sort of beside the point. Just guessing, I'm not the original author of course.
I guess I'm coming at this as a novice with a recent interest in compilers. I was working through a tutorial series for writing a compiler in Go, but it got frustrating when I realized the author's example language wasn't going to output anything lower than C.
I've since found a series that does show how to output to assembly (68000 no less) and I'm finding it much more rewarding.
I see the educational value in building a compiler to C if you want to introduce people to parsing, AST transformations, visitor pattern etc.
But a lot of interesting parts about a compiler are lost when you compile to a high-level language instead of machine code: code selection (to a large extent), register allocation, operation scheduling.
Doing a compiler backend right is much more challenging than doing a compiler frontend right, which is why tutorials usually skip it, and also why people are so grateful for LLVM, which gives a common compiler backend for about every platform imaginable.
"High level" is in the eye of the beholder, but I don't know that C is the best example to pick on: Before LLVM it was pretty common to target C as your portable assembly language.
Just to give an example, I think GHC targets C-- as one of
the steps in its transformations, and still outputs C when you're porting it to a new platform or building a cross-compiler. I don't know that it's useful to say that GHC ceases to be a compiler when you use it to target weird platforms.
And the Chicken Scheme compiler outputs C code[1] that doesn't look like a human wrote it: The C code uses a trampoline to organize function calls, and it periodically garbage-collects your stack cleaning up old frames as you call functions.
> Just to give an example, I think GHC targets C-- as one of the steps in its transformations, and still outputs C when you're porting it to a new platform or building a cross-compiler. I don't know that it's useful to say that GHC ceases to be a compiler when you use it to target weird platforms.
Cmm is GHC's final intermediate language, that's then passed to the machine code generator backend: asm, llvm or C (which is now only available un unregisterised mode, which as you note is mostly for getting started on new platforms)
Funfact, going from LISP to C (and back) is the whole of chapter 5 in SICP [1]. The last few exercizes in the book ask you to write a C compiler in Lisp, and then write a C compiler in the Lisp you just created [2].
That brings back fond memories. Back in the day on BIX¹ (is anyone else here old enough to remember BIX?) we were talking about different languages and compilers, and the topic of C++ came up. I said, "C++? That's a preprocessor, isn't it?"
Much to my embarrassment, Bjarne Stroustrup was hanging out in the same BIX conference, and he chewed me out for this - but very politely. Bjarne explained that Cfront was definitely not a preprocessor (which would imply something like the C preprocesssor), but an actual honest-to-goodness compiler that just happened to have C as its compilation target.
Now if we could only get the younger generation to abandon that horrible "transpiler" word and call things by their right name: a compiler!
This should not be a surprise. Consider what happens on a modern CPU after the preprocessors do their job, after the compilers do their job, when you're right down to machine code: the bare metal!
But really, how bare is that metal? Not at all.
Your so-called "machine language" just turns out to be another high level language that the CPU itself compiles down to the actual internal instructions that it executes, instructions that look nothing like what you thought was the machine language.
If you want a binary executable, yes. But that's sort of a tautology: If a program produces some output (like C source code or a binary), you usually run it because you intend to do something with its output later (like compile it or interpret it, or execute it on the machine if it's a binary).
Yes, either another compiler to translate it to machine code, or a C interpreter. Years ago, for example, there was a C interpreter that went by the name "Saber C," if memory serves.
>translating one high level language (lisp) to another (c) is technically a compiler
Is it not a transpiler? And a compiler from a programming language to a computer readable language like ASM?
Just asking, we can say compiler for everything I guess (I hear you Stratoscope), in fact transpiler has been new to me only recently with an article about Typescript.
The answer to your question is "Not really that big": 3.5kloc with comments/whitespace and 1.6kloc without.
Your point is valid that much of this is the library's doing, but I think more of the difference than you're letting on is just from how well-suited functional languages are to this sort of work compared to dynamic scripting languages. Parsec is practically a Haskell standard library, anyway; I wouldn't discount it any more than I would discount using Node's buffer or stream module.
In my opinion, the core function "compiler" -- is absolutely beautifully written (The crowning gem of this well-written compiler).
It's abstract without being too abstract (exactly the right level of abstraction), and it's a model of simplicity, comprehensibility (especially for people new to compilers), and elegance:
function compiler(input) {
var tokens = tokenizer(input);
var ast = parser(tokens);
var newAst = transformer(ast);
var output = codeGenerator(newAst);
// and simply return the output!
return output;
}
In short, a great starting point (especially for beginners) to conceptualize and subsequently delve into the depths of what goes on in a compiler.
An A+ and Two Thumbs Up for your efforts. The teaching potential of this compiler (which is always what I'm looking for in compilers) is huge!
Nice, but it needs to say up front what it does (i.e., compile a Lisp-ish PL to a C-ish PL).
EDIT. After a bit of analysis:
The lexer is basically a state machine. Each kind of token consists entirely of characters in a single class, so the usual state machine is modified to give some states a while loop that goes through characters until the token ends.
The parser is basically recursive-descent. The grammar is awfully simple, so only one parsing function (walk) is needed.
The transformer traverses the AST, identifies function calls, and produces a new AST with function calls marked. I'm not sure why the transformer is written as a separate piece of code. It seems to me that the functionality it adds could easily be integrated into either the parser or the code generator. And then the compiler would be even tinier.
And the code generator traverses the transformed AST and spits out C-like code.
Thanks! I was just telling people yesterday that I have decided to learn more about either compilers or operating-systems. You made my choice easier, I am going with compilers by studying the Super Tiny Compiler :)
For a more advanced but still very approachable JS example check out the lexer and parser in the GraphQL code base (https://github.com/graphql/graphql-js)
I have been working slowly through Sestoft's Programming Language Concepts, and through that I wrote a tiny interpreter in F# of a mini Forth-like language: http://ideone.com/24I5y0
Inspired by this though, I decided I'd try re-writing it into a compiler for it instead, targeting the MicroMini VM[0], and here it is: http://ideone.com/czXhoM
This is a great read and motivates me to create DSLs for everything. That being said, can anyone recommend one or more of their favorites references for creating DSLs? If it is relevant, my target audience are researchers in healthcare.
Thank you. This was a very informative read indeed. I'm currently in the process of rewriting a hobby compiler[1] that I hacked together over a year ago. This time I wanted to do it properly, and hence have been scouting on the web for a short, concise and to the point, tutorial that briefly explains most of the important concepts that I need to know be for I begin. This was exactly that tutorial.
Having used Bison as well has my own hand-written compilers, I personally would suggest that Bison is not worth it. With recursive descent, you end up writing the same BNF structure anyways, but you don't have to fight the Bison file structure just to decorate the tree and you can much more easily work around syntax errors.
Whether one uses a parser generator's output as part of a compiler is one thing, but one should at least pass the grammar of the language through a parser generator to let it check the grammar for ambiguity.
As someone just starting out on their first pet language, this is an amazing resource. I've already read a book on the subject and I think I learned more reading this than the book.
It makes me second guess my choice to use a parser generator (PEG.js) which I don't fully understand yet, this makes it seem simple enough that I could write my own.
Interesting enough today I was thinking if we have an AST of a program we could have tools that apply semantic analysis like ownership in rust and achieve the same benefits of it in any language.
Does someone know any already existing tool like this ?
LLVM. Easiest way to write your own compiler pass.
However, it's going to be tricky to handle ownership between threads as those are not intrinsic to most languages and therefore not native for the compiler, instead they are just a call to a linked in function (fork/clone/pthread_create/std::thread). Compilers for are unaware that the function actually invokes a system call which performs 'odd' (spawning a new thread) behaviour for a function. So in order to have some notion of 'ownership' by a thread you'd need to recognize all different ways of spawning a new thread for the different operating systems and platforms your compiler should support.
I wrote a tokenizer the other day without really knowing what I was doing and it's reassuring to see that it doesn't work that differently from this one.
No stupid questions.. I keep getting similar feedback. I'll do a better job of explaining it.
To answer your question, it's taking a lisp-style syntax and turning it into a C-style syntax (specifically JavaScript- the output AST is actually a valid Babel AST)
(add 2 (subtract 4 2))
Into
add(2, subtract(4, 2))
If your goal is to demystify a haute-programming concept for people who think of themselves as regular programmers, why not use the most popular language?
It's not like they're saying javascript should be used for all compilers. Even so, there are several javascript-based compilers and transpilers in widespread use today, so even javascript hackers should be able to apply the concepts of compilers in their work.
It's worth pointing out that Lisp is homoiconic, meaning that the code you see has the same structure as its AST, so writing a parser for a Lisp is pretty trivial and is part of the reason why this compiler looks super short. Other languages like C, however...
The "code generator" part of this compiler doesn't do much, since it only translates from one high level language to another, not machine code which would understandably be far more complicated, as other commenters have pointed out.
This is a compiler from an altitude of 50km in the stratosphere, and won't actually teach you much about real world compilers that do so much more under the hood. Sorry to be Mr. Cranky HN commenter (ok, I'm not sorry).
https://github.com/thejameskyle/the-super-tiny-compiler/blob...
For someone like me who always goes, "Comments? Meh. The code should be the comments!" it's an eye-opener.
The code has a lot of comments. But not this kind:
It's about 200 lines of actual code plus another 500 lines of comments explaining what is going on. And 75 lines of old-school ASCII header art like you used to see on line printers back in the '60s, just to appeal to old-timers like me.It takes you through tokenizing raw code, parsing the tokens into an AST, transforming the AST into a form the code generator can use, and then generating code from it.
A great example of how to teach a concept with a small amount of code and informative comments.