Hacker News new | past | comments | ask | show | jobs | submit login

This is really quite beautifully done. If you're impatient, go straight to the JavaScript code (but the README is worth reading too):

https://github.com/thejameskyle/the-super-tiny-compiler/blob...

For someone like me who always goes, "Comments? Meh. The code should be the comments!" it's an eye-opener.

The code has a lot of comments. But not this kind:

  i++;  // Increment i
It's about 200 lines of actual code plus another 500 lines of comments explaining what is going on. And 75 lines of old-school ASCII header art like you used to see on line printers back in the '60s, just to appeal to old-timers like me.

It takes you through tokenizing raw code, parsing the tokens into an AST, transforming the AST into a form the code generator can use, and then generating code from it.

A great example of how to teach a concept with a small amount of code and informative comments.




And once you understand the basics, you can move onto something more interesting like this self-compiling compiler:

http://homepage.ntlworld.com/edmund.grimley-evans/cc500/

This one is fun too, since it contains both a bytecode compiler and its interpreter VM:

https://github.com/rswier/c4

(Previously discussed on HN at https://news.ycombinator.com/item?id=8576068 and https://news.ycombinator.com/item?id=8558822 )

I think a Lisp compiler isn't quite as interesting for beginners as something that handles the more usual syntax with precedence etc., since it's so trivial that it leaves the "how does a 'real' compiler parse" question unanswered.

In that respect, I would suggest beginning with a simple recursive-descent parser that can handle the usual maths expressions - then extending it to e.g. all the operators in a C-like language is not a huge leap (table-driven precedence-climbing can be derived from recursive descent and makes this much easier), and it also leads to the opportunity for a compiler that can parse and compile itself - unless you are also writing the compiler in Lisp.

The AST stuff can come later; with expressions, the parser can perform evaluation as it parses, which gives the opportunity to write a calculator. Generating an AST is just replacement of evaluation with creation of tree nodes.

IMHO "it can compile itself" (as well as other non-trivial programs) is a good goal, and really makes people pay attention because it dispels the notion that compilers are somehow "magic".


> In that respect, I would suggest beginning with a simple recursive-descent parser that can handle the usual maths expressions - then extending it to e.g. all the operators in a C-like language is not a huge leap (table-driven precedence-climbing can be derived from recursive descent and makes this much easier), and it also leads to the opportunity for a compiler that can parse and compile itself - unless you are also writing the compiler in Lisp.

I'd suggest starting with a Top Down Operator Precedence ("Pratt") parser, they're beautiful magic for the working man, highly declarative and slightly mind-bending in that they seem so simple they shouldn't work, they make expressions with infix priority a joy to handle.


TDOP, Pratt, precedence-climbing, I think they're all referring to the same, if not extremely similar, method of using a loop in the main recursive function that compares precedences and decides whether to recurse or loop:

https://www.engr.mun.ca/~theo/Misc/exp_parsing.htm#climbing

http://javascript.crockford.com/tdop/tdop.html

They're definitely very simple, and can be derived by refactoring a recursive descent parser and observing that there's no need to make a chain of nested recursive calls to the right level when the function can be called directly to "jump" to the desired level. The simplicity is probably why it's been rediscovered several times and has different names.


TDOP and Pratt are exactly the same thing, Vaughan Pratt is the inventor, Top Down Operator Precedence is the name he gave to the algorithm. And TDOP never "recurses or loops" (the precedence tells it whether it should stop looping). Part of the interest is how object-oriented it is, the main parsing loop ("expression" in the Crockford article) just tells a token "parse yourself as prefix" or "parse yourself as infix preceded by (AST)"


They're the same algorithm expressed in different styles: structured for precedence climbing, and as you say OO or 'data driven' for Pratt.

https://github.com/darius/sketchbook/blob/master/parsing/pre...

https://github.com/darius/sketchbook/blob/master/parsing/pra...


Hi, author here.

> since it's so trivial that it leaves the "how does a 'real' compiler parse" question unanswered.

I'm sure it is tempting to bake everything into this project. After implementing a "real" parser, you might want to implement a "real" transform, an optimizing step perhaps, you could keep going until it's totally unrecognizable (https://github.com/EnterpriseQualityCoding/FizzBuzzEnterpris...)... the "give a mouse a cookie" effect applies here.

However, trying to teach everything at once isn't an effective way to teach people. I think this is largely why compilers are so unapproachable to most people.

I don't expect someone who has taken even just a simple Compilers 101 class to get anything out of this project. I'm hoping that I can make someone previously unaware of how compilers work to feeling somewhat comfortable with them. For a project that you can read through in 20-30 minutes, I think that's already a pretty big goal.

Self-hoisted compilers are cool, but that would contradict the goal of this project.


Shameless plug: I wrote a very compact and generic lexer, parser and interpreter for a small C-like language. I believe it can be of a great educational value for anyone who tries to understand the basic mechanics of parsing an imperative curly-brace language:

https://github.com/bbu/simple-interpreter


>The code has a lot of comments. But not this kind:

    i++;  // Increment i
It does however have this:

    // Then we increment `current`
        current++;
Buy yes - the comments in the project is written as a tutorial, and are very good.


Aha! Good catch, especially given my comment about the comments.

Yeah, I wouldn't write that comment in production code either. But as you said, it's a tutorial, so why not go for it? :-)


I sometimes comment code like this, including the odd "increment". Because omitting it would disrupt the flow of the code+comments.


I've not read the example and I take your point but I'd much rather see "Then we increment the current so that..."


Maybe at the edge when it doesn't mean anything straightforward like a 'car++' in a meta class where a better name won't help or would be too long. But ya.


I wish every project I had to work on that wasn't "new" would have been written as a tutorial...


   // And a `tokens` array for pushing our tokens to.
   var tokens = [];

   // We're going to create a `value` string that we are going to push
   // characters to.
   var value = '';
Sure OP tried to be super descriptive but there a few comments that you simply don't need at all with at least nice a little understanding.


Author here. I actually walked through all of this code line by line in a conference talk using another thing I built: https://github.com/thejameskyle/spectacle-code-slide

I ripped these code comments from my presenter notes where I was explaining everything in case people couldn't read it.

I'll have to go back in and clean it up


Please DO NOT. You are right in your other comment that it helps with the flow and for example in the case of

   // We're going to create a `value` string that we are going to push
   // characters to.
   var value = '';
your comment declares you intention in advance, even if it might be trivial to deduce.


some people get really weird about comments.

in a case like this, i actually like the dialogue/conversation nature of the comments. i think the opinion of "not needing them" seems miss what they're accomplishing.


Comments about intention are good.

Yes, an experienced dev can infer intent from context (name, project, 'culture'), and even a smart beginner can read ahead to infer intent. But the comment resolves ambiguity in the advanced dev's mind, and saves the beginner's mind from the difficulty of read-ahead inferences (of which he may have several going at once).

I say, keep it.


I am someone who doesnt know much about compilers and I found them helpful. Setting up each variables purpose builds up context in my mind lessening the struggle with the 'real' code.


That was my first thought too. Comments can sometimes be seen as an afterthought, something that you feel obligated to leave for future programmers or managers craving documentation. In cases like this, comments become an active part of the program, documenting and teaching parts of the code to anyone who wants or needs to understand.

It inspires me to leave more thorough, detailed comments, not for my own benefit, but for that of others. This almost reads like a really well-done technical paper in parts. Very neat.


I only glanced at it cuz Im at work. Yet, I found myself glancing longer than I shouldve because of the very things you mentioned. It was an amazing firsr impression.

I'll be looking ahain despite not knowing JavaScript. I bet I jnderstand most of it anyway given attention to readability.


If you’re interested this is called “litterate programming” [1], a form of programming where instead of having a few comments in a large program, you have a few lines of program in a large comment. It’s particularly adapted to tutorials like this one.

[1]: https://en.wikipedia.org/wiki/Literate_programming


Nitpick: this is not literate programming; it's good commenting.

Literate programming means the flow of the code follows the flow of the thought process: you write the story, and insert code where relevant. It includes other bits of code as macros.

From that Wikipedia page:

The literate programming paradigm, as conceived by Knuth, represents a move away from writing programs in the manner and order imposed by the computer, and instead enables programmers to develop programs in the order demanded by the logic and flow of their thoughts.[2]

...

Order of human logic, not that of the compiler

...

Instead of comments provided as side notes to source code a literate program contains the explanation of concepts on each level, with lower level concepts deferred to their appropriate place, which allows for better communication of thought.

Example (same page):

    The purpose of wc is to count lines, words, and/or characters in a list of files. The
    number of lines in a file is ......../more explanations/
    
    Here, then, is an overview of the file wc.c that is defined by the noweb program wc.nw:
        <<*>>=
        <<Header files to include>>
        <<Definitions>>
        <<Global variables>>
        <<Functions>>
        <<The main program>>
        @
    
    We must include the standard I/O definitions, since we want to send formatted output
    to stdout and stderr.
        <<Header files to include>>=
        #include <stdio.h>
        @
Not to say these comments aren't good; they're great stuff, they make it very accessible and understandable! But it's not Literate Programming. Just Good Programming :)


I like to call it literate programming because it’s very close to what Knuth was thinking about; there’s no other way to do it in JS.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: