Let's make a Teeny Tiny compiler

cloogshicer · on June 6, 2020

If you want to dive deeper into this topic, I also recommend the wonderful book Crafting Interpreters by Bob Nystrom. It can even be read for free online: http://craftinginterpreters.com/

I'm not affiliated with it, but I devoured it recently and thought it was amazing.

azhenley · on June 6, 2020

Author of Teeny Tiny here! I really like Crafting Interpreters too (I link to it at the bottom). Also highly recommend the book Writing an Interpreter in Go.

kqr2 · on June 6, 2020

Link to Writing an Interpreter in Go website:

https://interpreterbook.com/

ghosty141 · on June 7, 2020

I really do appreciate this book. It's absolutely amazing, especially since it's free but I kind of dislike how in some parts the code is just shown without explaining the theory behind it. I doubt you could follow the book very well without having almost all of the source code layed out infront of you.

I'd much rather have a language agnostic explanation that does not rely on certain classes or patterns that only work well in the word of OOP-Languages.

mhh__ · on June 6, 2020

'tis a good book.

Appel's compiler implementation in x (if you choose anything other than ML you don't deserve to live) is a good follow on

melling · on June 6, 2020

I’ve got a bunch of links about compilers in a GitHub repo:

https://github.com/melling/ComputerLanguages/blob/master/com...

cellularmitosis · on June 6, 2020

Nice! I have been collecting a list as well :) https://gist.github.com/cellularmitosis/1f55f9679f064bcff029...

boudewijnrempt · on June 6, 2020

What I always admired about the ZX-Spectrum's Basic was that it skipped the entire idea of a lexer, because, well, the keyboard let you input tokens directly.

Lerc · on June 6, 2020

I think it was closer to the other way around. The keyboard let you input tokens because it didn't have a lexer.

I suspect this is more due to the memory cost of storing a literal line of code more than the ROM or CPU cost of a lexer. Going back to the ZX-80 storing even the current line being typed as characters was probably a burden.

Someone · on June 7, 2020

That can’t be it. The Basic interpreter could run the lexer on each line the user typed and store the tokens. You would lose the ability to print lines exactly the way the user typed them, but that isn’t a big blocker for the technology of the time.

Alternatively, the Basic interpreter could store each line as a combination of tokens and characters. Commodore Basic did that (https://www.masswerk.at/nowgobang/2020/commodore-basic-renum...).

I also think none of the microcomputer Basics stored keywords as ascii strings. It would both slow down the interpreter and use more memory.

edw · on June 7, 2020

I can confirm that AppleSoft ][ BASIC stored its programs in memory in tokenized form.

On another note, I went ahead and wrote a code emitter, because I couldn't wait for part three (though I noticed it's in the GitHub repo). Once you get this compiler up and running the sane thing to do is write a Makefile so you can treat Teeny as a first-class programming language. Just add a Makefile with the following:

    %.c : %.tiny
    <TAB>python3 ./teenytiny.py $< > $@

And you'll be able to type `make hello` and have hello.tiny compiled to hello.c compiled to hello automagically. I can never pass up an opportunity to remind people that Make is wonderful.

tachyonbeam · on June 6, 2020

AFAIK the keyboard was also just very cheap and a pain to type on.

JasonFruit · on June 7, 2020

The keyboard was a disaster. My dad had an old teletype keyboard and adapted that to it when the original membrane died. I was pretty impressed.

webreac · on June 6, 2020

I don't remember complaining about keyboard. The only issue was loosing memory because the memory expansion (16ko) had bad contacts.

thom · on June 7, 2020

Ha, my dad tells me we had a strict rule of no running in the house while he was working on the Spectrum for this exact reason.

stevekemp · on June 7, 2020

I wrote a simple BASIC interpreter inspired by the Sinclair, over here

https://github.com/skx/gobasic

jacobush · on June 6, 2020

Which tied into how they could make the keyboard cheaper, remember the rubber keys? :-)

lixtra · on June 6, 2020

I wonder if he left out the

    LET nums = nums - 1

on purpose.

kragen · on June 7, 2020

Unfortunately this is not yet a compiler; it's a lexer and a parser. Traditionally these are the first things you present in a compilers course, but from my point of view, that's kind of dumb, for two reasons.

First, a parser isn't a compiler, so this means you get through the whole first part of the course without having a compiler. By contrast, a code generator is a compiler, at least if there's some way to invoke it.

Second, although the theory of formal languages and parsers is complex, well-developed, and fascinating, it's not clear that it's important to building a compiler. People have written large systems in FORTH. MUMPS famously ran a number of hospital systems without being able to break a statement across lines. And a complete scannerless BNF grammar for S-expressions (without readmacros) is something like this:

    <sexp> ::= <idchar>+ " "* | "(" " "* <sexp>* ")" " "*

And many, many huge systems have been written in Lisps whose grammars amount to little more than that.

My favorite grammars are PEGs, which accommodate scannerless parsing rather better than LL(1) and LALR(1) parsers, because PEGs have infinite lookahead with worst-case linear-time parsing (at the expense, it must be admitted, of being a huge memory hog). PEGs are also composable in a way that LALR and LL grammars aren't. I wrote a one-page PEG parser generator targeting JS at https://github.com/kragen/peg-bootstrap/blob/master/peg.md a few years back.

But I think an excessive focus on the syntax of a programming language really detracts from what's really revolutionary about programming languages (and compilers), which is their semantics. We have much better theories of semantics now than we had in the 1970s when the traditional compilers course was being laid out. Books like Essentials of Programming Languages can be wonderful introductions to the fuzzier ones, and there's a lot of work in things like Coq and Idris to come up with tractable formalizations. But you don't need much of a theory of semantics to get a simple compiler up and running!

But I haven't ever actually taught a compilers course, so maybe my ideas are miscalibrated about what students would enjoy and find motivating (having a working compiler for a simple language after doing the first problem set) and what students will find difficult out of proportion to any rewards it might bring (getting recursive-descent parsers for complex grammars working, debugging precedence rules, refactoring grammars to eliminate left recursion, encountering unexpected exponential-time parse failures, etc.).

abhayb · on June 7, 2020

CMU's Compilers course (15-411)[1] does a lot of what you want. Students write code generation code in the very first assignment. A particular delight of this approach is that the first assignment is for compiling straight line code while one of the later ones is for optimization. If you implement SSA form for the optimization assignment, all of the test cases from the first assignment compile down to effectively one precomputed result. Very clear marker for how far you've progressed!

[1] https://www.cs.cmu.edu/~janh/courses/411/16/

kragen · on June 8, 2020

Thanks! This looks really cool! There are a number of courses coming out with this kind of philosophy. I really like what I've seen of the nanopass framework, for example.

kerkeslager · on June 7, 2020

I'm really glad you said this, because it articulates something I've thought for a long time.

The focus on syntax in compilers literature is basically bike-shedding. People focus on syntax because they understand it, but code generation is much harder.

Matt Might's blog is a rare exception.

Do you have any resources on code generators?

kragen · on June 8, 2020

I don't know enough about code generation to offer confident recommendations! I mean I really liked Sorav Bansal's dissertation, Peephole Superoptimization, and in recent years there's been a bunch of noise about using SMT solvers. But what's the mainstream? What are the standard techniques, and what are their strengths and weaknesses? I have very little idea.

Still, I don't think it's fair to say without qualification, "Code generation is much harder." Very simple code generation can be done by pasting together canned code fragments (the original meaning of "compiler") and occasionally computing and encoding a jump offset. A very simple code generator like the one in StoneKnifeForth https://github.com/kragen/stoneknifeforth is simpler than the parser needs to be for many popular languages. However, at least with my limited knowledge, it appears to me that parsing is a relatively closed-ended problem — sure, you can work hard to improve your error detection and recovery, and to give more useful error messages, but you're apparently going to get very little return for even enormous efforts at that. Optimization, on the other hand, which is part of code generation, is potentially arbitrarily complex, and you can keep getting good returns on your efforts for quite a long time.

So I would say that the easiest code generation is usually easier than the easiest parsing, unless you have the liberty to choose your language to make it easier to parse. But the hardest code generation is much, much harder than the hardest parsing. (Again, unless you're parsing a language deliberately designed to be difficult!)

praveen9920 · on June 7, 2020

> We have much better theories of semantics now than we had in the 1970s when the traditional compilers course was being laid out.

Can you recommend any books for catching up?

kragen · on June 8, 2020

I don't know enough about the subject to offer confident recommendations. I really liked EOPL but it's not trying to be logically rigorous in the way CompCert is.

signaru · on June 7, 2020

Thankful that is a dialect of BASIC! The other great references mentioned here parse C-like languages. Although I can try figuring out, I'm interested to see a good tutorial that does parse BASIC syntax. Also interested to see how it is done in Python (whereas many use C/C++ or Java which have many features that make things a little easier). Looking forward to seeing this series finished.

bluedays · on June 7, 2020

I really like this.

Python is the first language I learned. So it's like this is being written in my native tongue.

I've read crafting interpreters, numerous times both before and after learning some Java basics, but seeing it in Python made it click.

Also love the way you used classes. That helped clarify some confusion I had with classes in Python.

yardshop · on June 7, 2020

I'm working through Crafting Interpreters right now, converting the Java to Python as I go. Most things are fairly straight forward to convert, but there are definitely some topics that I've had to learn about in Java (not a language I know very well) so I could figure out how to do it in Python. Really great book and great learning project.

mrlonglong · on June 6, 2020

REPEAT seems superfluous as only ENDWHILE is needed to end the loop. And, oh yes who forgot to decrement nums?

Edit: s/N/nums

ngcc_hk · on June 6, 2020

Python!? Anyway wait for part 2 and 3.

azhenley · on June 6, 2020

I released Part 2 yesterday! (I had forgot to update the link, which is now fixed.)

http://web.eecs.utk.edu/~azh/blog/teenytinycompiler2.html

dntbnmpls · on June 6, 2020

Not a fan of people sneaking referral links here. You could have easily recommended the book without the referral link and if you were going to link it, be open about it being a referral link.

azhenley · on June 6, 2020

Didn’t mean to, sorry. I just copied and pasted it from the blog post (which is labeled). I’ve removed it.

bogomipz · on June 7, 2020

You don't need to apologize. Their comment is uncalled for and just seems mean-spirited. I thoroughly enjoyed your article and am looking forward to future posts. I hope you post them here when they're up! Cheers.

dntbnmpls · on June 7, 2020

It wasn't mean spirited and it was justified. If people are using link shorteners to hide the fact that their link is a referral, then it invites everyone to sneak referral links in their comments here. Do you really want that?

He removed that link from his comment and he apologized because it was the right thing to do. Also, virtue signaling to encourage bad behavior isn't right, no matter how good or superior you think it makes you feel. Especially considering you weren't smart enough to realize that I wasn't criticizing the referrals in his article ( valid ), I was criticizing the shortened link in his comment which was a referral.

bogomipz · on June 8, 2020

Yeah, no. In civil discourse you give people the benefit of the doubt that they weren't doing something for nefarious purposes but that they maybe made an honest mistake. In this case it was simply a copy paste error which they pointed out. They apologized because they were being civil and polite not out of some admission of guilt. Somehow you were the one that missed all of this. I think you can see what others thought of your tone as well. Both of your comments here come across as very angry.

dntbnmpls · on June 9, 2020

> In this case it was simply a copy paste error which they pointed out.

Right and life went on before you decided to play "hero".

> They apologized because they were being civil and polite not out of some admission of guilt.

But you told him not to apologize. So that means you told him not to be civil? Also, it's not polite to speak for others as it isn't part of civil discourse.

> I think you can see what others thought of your tone as well.

Others? Just you. And it's rather childish to "appeal to downvotes".

> Both of your comments here come across as very angry.

"Yeah, no."

Listen, the only one causing drama here is you. I made one innocuous and helpful comment. The guy responded and everything was settled. The conversation had nothing to do with you and yet here you are.

userbinator · on June 6, 2020

There seems to have been an explosion of such articles within the past few years, despite things like Crenshaw's tutorial existing for several decades; I wonder why a large number of people have suddenly started thinking "I understand this, I should write an article about it".

My favourite demonstration of a "teeny tiny compiler" is still C4: https://news.ycombinator.com/item?id=8558822 Perhaps it inspired all these compiler articles?

molticrystal · on June 6, 2020

Always enjoyed Fabrice Bellard's work, like https://bellard.org/otcc/ which is 2048 bytes and self compiles & https://bellard.org/tcc/ , his TinyEmu and Javascript PC Emulator that boots several OSes including linux are cool as well.

staycoolboy · on June 6, 2020

azhenley · on June 6, 2020

Actually, my students were having trouble following other tutorials so I wrote this for them. (I also just love talking about compilers!)

I’d summarize my tutorial as a shorter version of Crenshaw’s.

I love seeing more and more compiler tutorials. They all have different goals and perspectives.

staycoolboy · on June 6, 2020

Sorry, my comment wasn't in the HN spirit.

hliyan · on June 7, 2020

Been on many internet forums for the past 20 years. The first time I've ever seen such candid yet terse retraction. This is what I like about HN as a community (either that, or moderation is out of this world).

userbinator · on June 7, 2020

It's disturbing to see such praise for censorship, unless that comment was really crossing the line. (I didn't see it.) Proof that HN is as much of an echo-chamber as all the others, despite somewhat discussion that it isn't.

(...and meanwhile I get downvoted for making an observation.)