I have used lex/yacc a number of times, and written several parsers by hand, and...

stormbrew · on Feb 4, 2014

I definitely don't think there's anything wrong with using them, but my experience has been more pleasurable without in general. I'm also not entirely sure you can ever really understand the benefits of using a CC, or the reasons they impose the limitations they do, if you've never even attempted to go without one. And I think that when you're first learning about parsing you'll probably be learning more at a faster pace doing it yourself than trying to learn your way around the rather frustrating tooling built around CCs.

I guess if I'm trying to be more clear I'm glad to see this perspective expressed, since it's a relatively rare one to see expressed authoritatively. There are a few topics in parsing that I don't think get enough discussion.

Like, my pet peeve is people thinking a naive implementation of the packrat parser is guaranteed to perform better than a plain recursive descent parser. In reality you might just be trading explosive memory growth for algorithmic steps. In a lot of cases that's actually slower.

anaphor · on Feb 3, 2014

Parser generators are a different issue than lexer generators imo. With a lexer, usually it ends up being a lot easier to do it by hand, because otherwise you end up fighting the system or writing awful regular expressions. Parser generators at least let you use nice formats like BNF.

Edit: also your lexical spec is much less likely to change than your grammar is.

auvrw · on Feb 4, 2014

so i really just want to say something like, "yeah, but Parsec is still awesome," but then i saw someone got downvoted for a similar comment with Rob Pike in place of Parsec, so what can i add to the discussion?

perhaps just that it's important to keep in mind that computer systems are things we've worked out for ourselves. these notions of "lex" and "parse" are not things given to us by nature...

...although, they do kind of do fall out the circumstances of (1) needing to emit x86 instructions and (2) the preference for writing programs in text editors. the second point gets a lot of discussion what with ideas about editors understanding parse trees and all, but i wonder what happens to the whole lexer-parser dichotomy if we keep 2 and periscope our notions of hardware. does the need for tokenization appear as a result of something fundamental to von neumann architecture, or is it just a result of currently-vogue instruction sets?

ah well, back to making things happen with the tools at hand.

saurik · on Feb 4, 2014

The argument for lexers has nothing to do with machine instructions: it has to do with algorithmic performance. Grammars tend to have time complexity that is greater than linear with a moderate constant (and a parser combinator library, which is really just a lightweight programming abstraction making it easier to develop by-hand recursive descent parsers, normally doesn't try to help with this problem), whereas a good lexer tends to have linear time complexity with a nearly zero constant. If you can separate your compilation phase into "first run a lexer, then run a grammar", you can parse much longer files (not just constantly longer, but asymptotically longer) in the same amount of time. There is no fundamental reason to separate these phases, however, and it has minimal effect on the resulting compiler: numerous compilers, even numerous compiler generators, have these phases combined into a single grammar step, with the tokens now being effectively characters. (There are also engines that try to slide between the two levels, using an algorithm more similar to a lexer at the base of your grammar, but still letting you define these psuedo-tokens in the unified grammar as rules.)

stormbrew · on Feb 4, 2014

I've personally found that even when working with parser generators or handwritten parsers that don't need the separation it still helps to consider them separate. Having a stable and solid set of fundamental tokens makes the resulting language easier to understand. Whenever I see a language (usually made with a PEG generator) that blurs these lines everything feels very shifty. Yes I know these are very touchy-feely attributes I'm describing, and you can obviously avoid them without that separation as well, but these things are important as well.

It's an interesting accident, to me at least, that this separation turned out to be both optimal and useful.