They've pointed out that the difficulty of parsing, and in a sense our overconfidence that we can just code up parsers for random languages and input formats when we need them, is a pretty pervasive source of security bugs.
A lot of those bugs can occur when you have two different parsers that have a different notion of what language they're supposed to recognize, so it's possible to construct an input whose meaning the two parsers disagree on. That can have pretty serious ramifications if, for example, the first parser is deciding whether a requested action is authorized and the second parser is carrying out the action!
I'm kind of sad about this because I love whipping up regular expressions to extract data even from things that regular expressions technically can't handle correctly. But there's a good argument to be made that this habit is playing with fire much of the time, at least in systems that will end up handling untrusted input. And the Shellshock bug is a recent example of the way that your intuitions about whether your software will "handle untrusted input" in some use case can go out of date.
See also this paper: http://www.ieee-security.org/TC/SPW2014/papers/5103a198.PDF. The authors implement a pdf file format parser and find bugs in pretty much all of the existing implementations. Basically pdf is a pretty shitty file format with several ill-defined corner cases. This is one of the reasons PDFs tend to be vectors for security breaches.
The lead author (Andreas Bogk) presented an earlier version of this work at the Chaos Communication Camp in 2011 ("Certified Programming With Dependent Types").
I went to that talk, and found it to be probably the most advanced and difficult math lecture I'd ever attended! (I was very grateful to see such mathematical sophistication brought to bear to protect PDF users, which is almost all of us.)
This is probably a good place to ask; I've wanted to build a language myself -- whats the best place to begin learning about parsers and the like? About a decade ago I asked this question and was told to read the "Dragon book" but I was far too young and lacked experience. Now I really want to get stuck into something outside of my day-to-day web stuff.
Start here: http://nathansuniversity.com/. All the stuff in the dragon book is designed for limited memory and limited computing capability environments. There is no reason to worry about LL(k) parse table size or predict and follow sets when you don't have to. For most practical purposes you can get away with basic recursive descent or PEG parsers. The link I pointed you to starts with peg.js which is PEG parser generator in JavaScript.
If you get through PL101 then picking up the stuff in the dragon book or any other book on parsing and compiling technology will be much easier.
Another resource I like is "Compiler Design: Virtual Machines" (http://smile.amazon.com/Compiler-Design-Machines-Reinhard-Wi...). Still going through that one but it is very readable and if you go through PL101 then you'll have all the tools to implement the virtual machines described in that book. It is much easier to write a compiler to target machine code or some other language like C when you've built a few targets yourself and understand the trade-offs involved.
There's also http://www.greatcodeclub.com/. I think one of the projects is a simple virtual machine and another one is a compiler. Well worth the admission price if you're a beginner and want some help getting started.
I learned a lot and PTAPG introduction specifically says that it avoids heavy theory. It has all algorithms required in very understandable presentation. It also says what this or that algorithm does to exploit that or this feature from grammar structure.
Just start writing the parser/compiler/interpreter for your
language, today. You will quickly learn what you need,
tokenizer/lexer, some way to represent the AST, how to represent
types (if you have any), etc. Later start looking at the
literature to see how "its really done". I find this approach
takes more time perhaps but I have a much better appreciation for
what I am reading afterwards (Oh that's why they did X!).
The dragon book (at least the first edition) has 330 pages on parsing. 330 pages. IMO that is excessive for a beginner, given how relatively easy it is to implement a parser for a fairly standard language.
Jeffrey Kegler's work on Marpa is pretty exciting, but hasn't got much traction (maybe due to the implementation languages: Perl and Knuth's Literate Programming).
Yeah, Marpa is quite amazing—it inspired me to build nearley. It's probably more powerful than any other parser generator out there right now (and the fastest, because it has Leo optimizations and Aycock-Horsepool nullables). In fact, Kegler responded to this same article with http://jeffreykegler.github.io/Ocean-of-Awareness-blog/indiv...
How exactly "PEGs are rather inexpressive"? Still the same BNF, with some nice bells and whistles added. As for the left recursion, it's not a big deal. You mostly need left recursion for the binary expressions, and they are much better served by a Pratt parsing (which is trivial to mix with Packrat anyway).
I moved to PEG+Pratt exclusively and never needed anything beyond that, for even craziest grammars imaginable.
One issue with PEG's (and other parsers) is that it doesn't address (unbounded) count fields (or bounded count fields in an elegant manner) or offsets. This means a pure PEG can't express e.g. PDF or ZIP files.
To address this, we built a PEG-based parser generator with a few new features, Nail (paper at OSDI 14, github.com/jbangert/nail)
Another issue is that it uses at least the same memory as the the input. Not that I'm a PEG expert, but it also basically feels like a formalised recursive decent parser. Nothing wrong with that, but changing the grammar afterwards can have a rippling effect and require much more work than with a traditional LALR parser.
How is it so? If you're referring to the Packrat memoisation (which is not the only possible PEG implementation), you can do a lot of memory optimisation, like discarding memoised entries based on some rules (e.g., once a top-level entry, like a function or a class definition is parsed, all the alternative interpretations can be thrown away). You can memoise complex entries but re-parse simple tokens. And many, many more.
I've been working on a derivative parsing algorithm for PEGs; it only uses memory proportional to the amount of backtracking and grammar nesting (i.e. about the same as recursive descent), but still gives a polynomial worst-case bound on time. An early draft of my paper on it is at [1]; this turned out to be about 6x slower than packrat when I actually built and tested it, but I've come up with a substantial simplification to the algorithm that I'm quite optimistic will have better performance results once I finish debugging my code.
I like this idea! Mind if I steal it for my PEG generator?
This is probably the best property of PEGs, they're extremely extensible and flexible, you can add features not possible in any other parsing technology, including high order parsers, dynamic extensibility, etc.
XL features 8 simple node types, 4 leafs (integer, real, text, name/symbol) and 4 inner nodes (infix, prefix, postfix and block). With that, you can use a rather standard looking syntax, yet have an inner parse tree structure that is practically as simple as Lisp. The scanner and parser together represent 1800 lines of C++ code with comments. In other words, with such a short parser, you have a language that reads like Python but has an abstract syntax tree that is barely more complicated than Lisp.
It's also the basis for the Tao 3D document description language used at Taodyne, so the approach has demonstrated its ability to work in an industrial project.
I designed the Earl Grey language (http://breuleux.github.io/earl-grey/repl/) on pretty much exactly the same principle, it's pretty interesting to see someone else had the same idea. Operator precedence grammars are surprisingly powerful and pretty simple to write (you can write a working parser in ~100 lines of Python or JavaScript, including support for multifix).
I did simplify the AST further than XL and added a few invariants. I have 3 leafs (literal, symbol, void) and 3 inner (send, multi, data). The inners roughly work like this:
f x (send f x)
[x] x
[x, y, ...] (multi x y ...)
{x, y, ...} (data x y ...)
a + b (send + (data a b))
+ a (send + (data (void) a))
a + (send + (data a (void)))
[] serve as grouping brackets and are therefore purposefully erased from the parse tree if they only contain one expression. Prefix/postfix operators are interpreted as infix operators that have a blank operand on either side. Infix operators are themselves reduced to send/data, so "a + b" is equivalent to "[+]{a, b}". I find that a bit more robust and easier to manipulate generically.
Judging from the article, if you parsed "if 3", you'd get back out
(prefix if 3)
instead of an error, since `if` is just a regular prefix operator. So, presumably there's some sort of well-formedness checking that goes on after parsing? Does anyone know more? The website says very little.
True, 'if 3' will stay there if there is no rewrite for it. You can also use treat a lack of possible rewrite as an error, this is what the Tao 3D document description language does. You'd get an error like "No form matches 'if 3'".
Underneath, there is a rewrite operator, ->, that means to transform the shape on the left into the shape on the right.
The precise semantics are a tiny bit more complicated than that, but it iterates until there is no more transformation that applies. Some specific rewrites will be transformed into "opcodes" of the underlying machine model. So A+B with integers on each side will execute an addition (the inner representation of the transform is something like A+B -> opcode Add).
Yeah that puzzles me a bit too. Multifix operators are not more complicated than prefix/postfix/infix (well, only marginally). For instance, you can give each operator a left priority and a right priority and merge them when they meet with the same priority. Then `if x then y else z` would become (if/then/else x y z) or something like that. I think that's saner and easier to handle than what he does.
In XL, a multifix is a combination of infix, prefix or postfix. Note that the various nodes exist only for the programmer's convenience, i.e. they represent a user-visible syntax. Internally, I could do with only a node with two children.
Something like [A,B,C,D] writes as:
Block with [] containing
Infix , containing
Symbol A
Infix , containing
Symbol B
Infix , containing
Symbol C
Symbol D
So this allows me to represent multifix along with their expected syntax. I could also represent for example:
[A, B; C, D]
if A then B else C unless D
for A in B..C loop D
The last one is actually one of the for loop forms in XL
Multifix is more general than a combination of infix, prefix and postfix, though. For instance, a [] block is a multifix operator, but you can't implement it as a combination.
The advantage of supporting multifix directly is that you get a more convenient and less confusing representation. For instance, given "if A then B else C" why should I expect (else (then (if A) B) C) and not (if (then A (else B C))? It's not obvious, and both are unwieldy compared to a ternary operator like (if/then/else A B C).
Multifix is not more general than combinations of the 8 node types in XL, only more powerful than a combination of some subset.
It's not simpler. XL started with a multifix representation, see http://mozart-dev.sourceforge.net/notes.html. Switching to the current representation was a MAJOR simplification.
The current representation captures the way humans parse the code. Infix captures "A+B" or "A and B". Prefix captures "+3" or "sin x". Postfix captures "3!" or "3km". Block captures "[A]", "(A)", "{A}" or indentation. Since humans perceive a difference, you need to record that difference somewhere. XL records that structure in the parse tree itself, not on side data structures such as grammar tables.
This approach also enables multiple tree shapes that overlap. Consider the following XL program (I replaced asterisks with slashes because the asterisk means "italics" for HN):
A/B+C -> multiply_and_add A, B, C
A+B/C -> multiply_and_add B, C, A
A+B -> add A, B
A*B -> mul A, B
In that case, I can match a multifix operator like multiply_and_add without needing a representation that would exclude matching A+B or A/B in other scenarios. This is especially important if you have type constraints on the arguments, e.g.:
A:matrix/B:matrix+C:matrix -> ...
A:matrix/B:real+C:real -> ...
Those would be checked against X/Y+Z in the code, but if X, Y and Z are real numbers, they would not match.
I have gone the other way with the languages I designed, from the system you describe to multifix, and have had the opposite experience. A proper implementation of general operator precedence grammars is much simpler than implementing prefix, postfix, infix and blocks specifically.
Now, I'm not sure how you implemented multifix, but a good general implementation for it is not immediately obvious, so I suspect you simply used an algorithm that's more convoluted than necessary and made you overestimate the scheme's actual complexity. Implemented properly, though, it's ridiculously simple. To demonstrate, I have hacked together a JavaScript version here (sorry, I didn't get around to commenting the code yet):
The core of the parser is the oparse function, which clocks in at 32 lines of code. Along with a basic tokenizer, order function, evaluator and some fancy display code, the whole thing is barely over 200 LOC. I dug around for XL's source and from what I can tell, what you have is not simpler than this.
I also never suggested making multifix operators like multiply_and_add. Multifix is for operators which are inherently ternary, quaternary and so forth. For instance, when I see "a ? b : c" I don't parse it as prefix, postfix or binary infix, I parse it as ternary.
I just found this "old" article looking for an easy to use parser in C++ (in Visual Studio, GCC, and CLang). I think the current state of parsers in C++ is... how to say it... terrible! flex and bison doesn't look like C++, Boost Spirit too much ado, ANTLR4 doesn't support it yet and setting up ANTLR3 in Visual Studio can be explained in a few steps if it were explained in a straightforward way.
When you look at the simplicity of OMeta[1] you feel the difference. But I am not aware of any (production ready) OMeta implementation for C++.
Both are based in Bryan Ford's packrat (PEG is Ford's too). OMeta is more like PEGTL in that it's a set of facilities/library at a highish level, more than a particular grammar system (they're all packrat parsers).
pegtl is not actually a packrat parser. Not all PEGs and absolutely not all recursive descent parsers are packrats.
And actually, my experience with packrat parsers (mostly in ruby) in other languages has been that they actually slow things down on moderately or more complex grammars by massively exploding memory use and thus allocation pressures. Turning it off can make it faster, especially on complex grammars. It's a pretty good case study in how optimizing an O(n^2) worst case to O(n) does not always improve things.
That said, I'm not against the principle, but the shotgun approach to it can be brutally bad. You really only want to memoize the paths that are actually likely to backtrack. Or simple grammars where O(n^2) memory use is not going to balloon your memory use too much it's a clear win.
> Not all PEGs and absolutely not all recursive descent parsers are packrats.
Of course not, see the link I posted. PEGs are TDPL and recursive descent (not the other way around), and an alternative to CFGs. PEGs were coined by Bryan Ford, who then coined packrat parsing based on them.
You want to do smarter things than just memoising everything, yeah. Especially for complex grammars.
I've been working on a new PEG parsing algorithm, and though my first crack at it was way too slow and I still haven't finished debugging the second version, I was able to show that packrat parsing had about a 40% runtime hit over recursive descent for well-behaved inputs on some common grammars (and linear memory usage vs. constant).
Can't edit anymore, but I want to clarify that I mean that the grammar takes a long time to compile. The performance of the resulting parser is pretty good.
I should add Adaptive LL(* ), ALL(* ), of ANTLR 4 to the mix here. It handles any grammar you give it and generates a correct parser except for one small caveat: no indirect left-recursion. It's the culimation of 25 years of focused effort to take LL-based parsing to the limit of power while maintaining simplicity and efficiency. See tool shootout in OOPSLA '14 paper I just presented http://www.antlr.org/papers/allstar-techreport.pdf Until we change the requirements of a parser generator, I'm done. :)
I noticed with the languages I was working on, the problems could be resolved by being smarter with the lookahead: this parser allows for context-free lookahead matching to resolve (or detect and defer) ambiguities.
That makes it possible to do neat things like parse C snippets without full type information or deal with keywords that aren't always keywords (eg, await in C#).
It is nice to see someone summarizing this kind of information. However, really this is a continuation of the academic attitude toward parsers making them MUCH harder than they have to be.
If you want to study grammars in an abstract sense, then think of them this way, and that's fine. If you want to build a parser for a programming language, don't use any of this stuff. Just write code to parse the language in a straightforward way. You'll get a lot more done and the resulting system will be much nicer for you and your users.
> Just write code to parse the language in a straightforward way.
This approach is why many consider parsing to be a solved problem, so it's certainly a valid approach. However, it's not the only valid approach.
For example, "straightforward" parsers often give terrible error messages: when the intended branch (eg. if/then/else) fails, the parser will backtrack and try a more general alternative (eg. a function call). Not only does this give an incorrect error (eg. "no such function 'esle'"), but it might actually succeed! In which case, the parser will be in the wrong state to parse the following text, and gives a non-sensical message (eg. "unexpected '('" several lines later).
This is an important problem, since these messages can only be decyphered by those who know enough about the syntax to avoid hitting them very often! Inexperienced users will see error messages over and over again, and have no idea that they're being asked to fix non-existent errors in incorrect positions.
That's horrible advice. Some problems are really complex and can't be solved just by working from A to B.
Your parser will probably not end up being able to handle the problems outlined in the article unless you take that theory into account before starting to program.
Which tools in particular do you feel are acceptable? "Just writing the code to parse the language" is almost definitional of shotgun parsers, so it sounds like you are advocating an ad-hoc parser (probably something recursive descent-ish, or perhaps with even less structure than that), but I wouldn't want to put words in your mouth.
ADDENDUM: In particular, the article mentions PEGs. Not in an unambiguously positive light, but nonetheless: do you count PEGs as being a useless overly academic tool?
The LANGSEC project has done (in my opinion) a pretty good job arguing that this behavior causes many security issues in software. Perhaps only to some extent for programming languages in general, but in particular for data format languages and network protocols. The issue is slightly mitigated for programming languages because they typically don't have many wildly varying parsers; but (programming) languages without clearly specified grammars certainly have had problems with this in the past.
Are you familiar with their work? Do you just disagree with their conclusions?
This approach might be why several ubiquitous languages have needlessly ambiguous grammars. Decades of writing tools to be bug-compatible with the original implementation is much harder than learning a tiny bit of theory.
I've wondered why most existing LALR(1) parser generators are not replaced with LR parser generators with a higher look-ahead than 1. This improvement, considering that our computational power today does not justify these restrictions any more, would be Pareto-optimal even while it is being decided what completely alternative strategies (PEGs, boolean grammars, parser combinators, etc.) will take over.
Right. It's kinda impressive what it can do. On one end of the spectrum it has all the power of PEGs and on the other it has the predicative capabilities of LL(k) so there is basically no restriction on the kind of grammar it can generate parsers for.
The more generic term for a lot of the stuff he talks about in that post falls under projectional editors and programming language workbenches. JetBrains MPS is one of the tools among many that help with building such editors. Here's a good talk by Markus Völter on using tools like JetBrains MPS.
"The more generic term" is really a term that was re-invented by Fowler, who is no stranger to inventing new terms for old concepts. Syntax Directed Editor is a much older term.
Instaparse, a parsing library for Clojure, tries to make context-free grammars "as easy to use as regular expressions." It uses an EBNF input format that should be familiar to many programmers. https://github.com/Engelberg/instaparse
http://langsec.org/
They've pointed out that the difficulty of parsing, and in a sense our overconfidence that we can just code up parsers for random languages and input formats when we need them, is a pretty pervasive source of security bugs.
A lot of those bugs can occur when you have two different parsers that have a different notion of what language they're supposed to recognize, so it's possible to construct an input whose meaning the two parsers disagree on. That can have pretty serious ramifications if, for example, the first parser is deciding whether a requested action is authorized and the second parser is carrying out the action!
I'm kind of sad about this because I love whipping up regular expressions to extract data even from things that regular expressions technically can't handle correctly. But there's a good argument to be made that this habit is playing with fire much of the time, at least in systems that will end up handling untrusted input. And the Shellshock bug is a recent example of the way that your intuitions about whether your software will "handle untrusted input" in some use case can go out of date.