I love parser combinators. It's a much more natural 'format' than parsing with regular expressions which I have to do at $DAY_JOB.
One aspect that I see discussed less often is the idea of 'builder combinators', which are kind of the inverse of parser combinators. There are multiple libraries in Haskell which make it very easy to write HTML/SVG/... this way:
I like using Elm's JSON library as an example because it's very small. Since there are only a handful of functions you can understand the entire thing pretty easily. Plus, encoding JSON is a great jump-off into the decoding half of the library which is basically a simple parser combinator system.
I'm normally a recursive descent-man, but I implemented parsec-style parsing in Java once, just because I was curious to see what the world looked like when it was on fire. I don't have the code as this was just me wasting company time during a hackathon they insisted we all attended. I did get it working, despite the type system I was working in.
What are the benefits of these monadic builder interfaces instead of simply pure functions? One can imagine a simple Html type with html tag functions e.g.
`div :: [Attributes] -> [Html] -> Html`
Instead of using do notation to build up multiple children one would simply supply a list of children. Added benefit of using simpler map/filter/filterMap/mapMaybe vs their monadic counterparts
The builder interface has important performance benefits over working with lists: similar to e.g. concatenating many small Strings vs using StringBuilder in Java.
The builder interface indeed does not need to be monadic, you can simply append things using <> and get the same benefits.
The monadic interface is purely a syntactic enhancement on top of that, to reduce the number of operators you need.
An example of a concrete usage of parser combinators and their performance: OCaml's http/af library [1] is a web server that uses parser combinators to implement HTTP 1.1. As for the performance, it's in the ballpark of Go's net/http and Rust's hyper [2]. There's also h2 [3] that implements HTTP 2. They are based on Angstrom [4], which started as a direct port of attoparsec [5], cited in the article. It's great to see all the cross-pollination happening in that space!
Rust has a parser combinator library called Nom [^1] that is really a joy to use. I know a little Rust—enough to know what a lifetime is but not so much that I don’t have to keep referring to the documentation when building something. It took me just a few hours with the docs and examples to get a decent parser together. So nice to work with combinators!
Why are parser combinators not given any attention in traditional academic treatments of parsing? Is there some drawback compared to other popular methods? (And why are they relatively more popular on HN and in functional programming communities?)
For instance, in the book of Grune and Jacobs on parsing techniques, they are presented as essentially a curiosity (with 2.5 pages of description in a 600+ page book).
From an academic point of view, most parser combinator libraries are basically LL parsers with backtracking. This is eminently practical (especially if the backtracking part is done well), but from an academic point of view that means they're not very interesting - really just a nicer way of implementing recursive descent parsers.
There is of course plenty of academic work specifically on how to implement parser combinator libraries efficiently, how to make them invertible (so they can also produce prettyprinters), provide good error messages, etc.
Most replies to this are making a category error. Parser combinators aren't a parsing technology, they're a library interface. You can use any parsing technology as a backend to an interface that is built in terms of combinators. That most of them are essentially recursive-descent is a result of common implementation decisions, not a requirement of the interface.
And that's why parsing research doesn't pay much attention to it - it's an interface, not a different approach to handling data while parsing. It's all about ergonomics of using a parser, not about the capability and performance characteristics of parsing.
Should more research pay attention to that? Probably. BISON/YACC/ANTLR are all hell to use. Combinator-based libraries are easier, but current designs lack things like extending parsers without modifying them. There's a lot of research that could be done that would care about this aspect of parsers, but it doesn't seem to be for now.
They have an exponential complexity in general case, I guess that's the reason. There is a way to make them be of polynomial complexity in the worst case, but it requires quite an ingenious memoization technique, coupled with lazy evaluation.
Parser combinators are just a nicer syntax for implementing recursive descent parsing†. That's not enough academic content for the subset of academia focused on Algorithm and Data Structures. But of course if you are interested in another subset interested in Programming Languages and Types, you'll find plenty of people interested not in the algorithm (which is easy) but also the way it's written in a real programming language, and leveraging type systems to make the code nice; that's where you'll find most of the academic discussions on it.
Also, higher-kinded types are pretty much required to make parser combinators nice, so that restricts the appeal of parser combinators to just Haskell and a few other niche languages.
Yet another reason is that like all recursive descent parsing with backtracking, it's easy to accidentally make your parser run in exponential time. It certainly requires discipline and deliberation to avoid that.
†: I only know of one library in Haskell that has a parser-combinator-like interface without recursive descent parsing, and that's https://github.com/ollef/Earley but I'll quote from its README the difference between that and typical parser combinators:
> The grammar language is similar to that of many parser combinators (Parsec, Attoparsec, parallel parsing processes, etc.), providing an applicative interface, but the parser gracefully handles all finite CFGs, including those with left-recursion. On the other hand, its productions are not monadic meaning that it does not support context-sensitive or infinite grammars, which are supported by many parser combinator libraries.
There is the PEG paper [1], which is basically a formalisation of parser combinators.
But really, otherwise there is not much to say about parser combinators, as they are just code. The cool thing about parsing is that usually you can give much nicer descriptions of the grammar than via parser combinators, and you can analyse these descriptions (uniqueness, for example), und generate optimised code.
As people said, they are simpler, less strict (what means you can get bad performance if you use them wrong), and just an abstraction over the very general concept of recursive context-dependent parsing.
There isn't much to cover about them, they are just very ergonomic. Even guaranteeing a good performance is covered on the theory of other types of parser, that is them trivial to replicate on them, but it's not about them.
Since it is not mentioned in the article: Python users may also want to check out pyparsing [0]. It is slightly different from Parsec/FParsec (for instance, it ignores all whitespace by default), but I think it is a really good project.
Parser combinators in F# were where I really grasped the utility of a Monad. F# doesn’t have HKT’s, and computational expressions are a little different from Haskell ‘do’.
It was the ‘(>>=) = Parser<‘a> -> (‘a -> Parser<‘b>) -> Parser<‘b>’ that flicked the lightswitch in my head.
In .NET I still find it weird that people sometimes implement the visitor pattern without the output type. You'd have to truly think imperatively to consider:
void Structure(IVisitor visitor);
a good function signature, instead of:
A Structure(IParser<A> parser);
I mean fair enough that you can't be bothered to implement the functor and monad methods, but it's downright silly to define a function that is incapable of returning output.
In .NET specifically a reason I can think about doing that is that you'd need to implement your visitor twice, as the type argument cannot be void (if I recall correctly).
public sealed class Unit {
private static Unit m_value = new Unit();
public static Unit Value => m_value;
public Unit() { }
public override string ToString() => "()";
public override bool Equals(object obj) => obj is Unit;
public override int GetHashCode(object obj) => 0;
}
and then using it ("return Unit.Value") in place of void is really not that big a problem.
True, but I just never quite managed to understand why you'd want your visitor to return nothing. And if you did why it was too much hassle to just return '0' or something.
Is there such a thing as a stochastic parser combinator? E.g. to do Viterbi/HMM decoding, etc.
Off-topic: I'd really like to get more and more Haskell features in Python. Playing with the adt library for algebraic data types, and I'd also like to find something to reproduce monads/applicatives/functors/arrows etc
Well here I think the Haskell abstractions chosen here are a bit unfortunate. I mean it's nice to use Applicative and Alternative but when you've got that
p <*> (q <|> r) = (p <*> q) <|> (p <*> r)
or at least that these are equivalent, then you might want to step back and wonder how on earth distributivity got into it. And after thinking for a few seconds and realizing that the parser which parses the empty string and the parsers which parses nothing correspond to the neutral elements for <*> and <|> respectively you might start to suspect some things.
Anyway personally I reckon the proper space for dealing with these questions is by treating the space of parsers as a semiring. Although for parser combinators you break the commutativity of addition, but I can't find a better name than semiring right now.
Basically you can show that if you've got something that sends the space of strings to a semiring then you can turn it into a linear map on the semimodule formed by simply multiplying strings and values of the semiring in the obvious way. You then get a semiring of parsers by using composition and addition of linear maps.
Parser generators as described in this article are essentially based on the semiring formed by functions with composition as multiplication and f + g = f as addition (note, not commutative). This is wrapped in an 'Either' which also seems to have some semiring structure that I can't be bothered to analyse right now, though I noticed that <*> and <|> translated pretty neatly.
Anyway all you have to do to get your Viterbi/HMM equivalent is to use the semiring formed by real numbers with 'max' as multiplication and ordinary multiplication as the addition operator.
See also 'Kleene algebra' for more information in this direction.
One aspect that I see discussed less often is the idea of 'builder combinators', which are kind of the inverse of parser combinators. There are multiple libraries in Haskell which make it very easy to write HTML/SVG/... this way:
https://hackage.haskell.org/package/blaze-html
https://hackage.haskell.org/package/blaze-svg
I haven't needed to do the same with SQL but I imagine that this already exists as well