Regex Parser in C Using Continuation Passing

burntsushi · on Dec 20, 2017

> One drawback of the CPS technique in C is its stack usage, especially because it is quite hard to understand at all what is going on, which makes it very hard to predict how much stack is used. Usually, I get SIGSEGV from pointer misuses (typically NULL), but in debugging this program, instead I got an infinite recursion causing SIGSEGV a couple of times.

I think the key missing drawback that isn't stated here is that even if you fix all of your logical bugs, then unless every function call is tail-call optimized by the compiler, then your parser isn't quite safe for untrusted regular expressions. e.g., A user could write `(((((...a...)))))`, and if you don't guard against it with some hard limit on nesting depth, then you're vulnerable to stack overflows. So let's say you implement nesting limits. What limit should be set? /shrug

This is why I write my regex parsers without recursion and instead use an explicit stack on the heap. It is then a bit easier to more robustly manage nesting limits.

tinkersleep · on Dec 20, 2017

You are absolutely right!

But I'd like to stress that this is a problem of this implementation, but not of the CPS technique as such: CPS is just not used to its extreme here to allow for tail-call optimisation of all functions, which would then result in O(1) stack usage.

The problem with this implementation is that gcc cannot tail-call optimise every function call, because continuation results are evaluated in the middle of the computation (e.g. for all 'or' decisions). This means the compiler cannot discard the old stack frame.

I should try harder and see what gcc can then do for me! :-)

burntsushi · on Dec 20, 2017

I am quite skeptical that it is possible. Because of that, I would love for you to get in touch with me if you achieve it. :)

Also, I should have led with this: I very much enjoyed the blog post and perusing the code. Great work!

tinkersleep · on Dec 21, 2017

Thanks! It was fun tinkering.

For the regex problem, because of the required backtracking, data usage is definitely not O(1), so the continuation data will not be storable on the stack for an O(1) stack tail-call-only CPS version of regex. That context data for the continuations will need to go to the heap, and so we are basically back to how you'd avoid recursion to avoid unbounded stack: by using some data structures on the heap. And at that point, you could just use more straight-forward data structures than those needed for CPS here.

Anyway, the one-phase parsing+matching is still interesting, I think, and it is worth a try to get it to work in O(1) stack by storing the continuation data on the heap. So after fixing SIGSEGVs for infinite recursions, I will go back to fixing memory leaks. :-)

Peaker · on Dec 20, 2017

gcc supports auto-generating stack overflow checks.

You can implement your own handler for a stack overflow, allowing allocation of additional stack space which can be chained to the original stack. This is how some language runtimes handle [or used to handle] segmented stacks that grow on demand.

burntsushi · on Dec 23, 2017

> gcc supports auto-generating stack overflow checks.

I don't know what that means. I don't see how auto-generated code would fix my problem. The regex parser should return an error to the caller if the regex exceeds some limit.

Dynamically growing the stack sounds like a cute hack. I doubt I'd consider it superior to just using the heap though.

Peaker · on Dec 24, 2017

You can use that to make stack overflow a (relatively) graceful error instead of a segfault.

kindfellow92 · on Dec 20, 2017

Parser(s)? As in more than one? Hopefully you’ll ever need to only write one parser (e.g. for Rust) but practically zero since regexs are a solved problem.

burntsushi · on Dec 20, 2017

Yes. I wrote Rust's regex crate. The parser has gone through a couple rewrites (and another one is ongoing), so that's where the plural came from. :)

zodiac · on Dec 20, 2017

Regexes can't be used for parsing (e.g. the language of balanced parantheses)

ThePadawan · on Dec 20, 2017

I think your parent post referred to using a parser to parse regexes, not using regexes to parse another language.

steveklabnik · on Dec 20, 2017

Regular languages cannot. Many (most?) implementations of “regular expressions” can express more complex grammars.

(My sibling is also correct)

galapago · on Dec 20, 2017

Why this was flagged?

rurban · on Dec 20, 2017

I don't understand what he is talking about. A very simple one-pass recursive regex matcher is trivial to write. I wrote the one for asterisk in 2003. No fancy CPS needed.

http://asteriskguru.com/archives/asterisk-dev-better-pattern...

tinkersleep · on Dec 20, 2017

Hmm, I have yet to see a trivial implementation for full-featured regex syntax, which was my goal.

The idea was not to parse a simple dialect of regex, but to parse a full-featured dialect, including things like \d, \), [)] and multi-level (), like (a|(b|c\)d[e)f])*g|h)+, so, e.g., the simple approach of scanning for the closing ) when you find a ( does not really work. Just for parsing such regexs this requires a recursive scanner, and then, a recursive data structure to store the result of parsing the regexp, and this runs counter to matching at the same time.

The linked code you provide assumes that you can first parse the whole pattern (like '(...)'), interpret that pattern, and then recurse to match it. There do not seem to be test cases in your code for the complex regexs cited above. So I don't think the approach of trivial recursion works with a full-featured standard regex grammar.

rurban · on Dec 21, 2017

I see, thanks. But practically such an parser explodes, and should not really be used. A simple one is still linear, and much faster than the current regcomp/regexec overkill.

kmill · on Dec 20, 2017

If I am reading your code correctly, I don't see full backtracking search. The CPS version should implement this (though I find their particular CPS style rather difficult to skim, so I'm not confident without running it).

jabot · on Dec 20, 2017

It seems that "fancy CPS" is exactly what he is talking about. A simple regex matcher makes for a good example IMHO.

onli · on Dec 20, 2017

For dang or an other mod: This was probably flagged by accident.

kazinator · on Dec 20, 2017

Do C compilers do tail calls through a pointer?

(How does that work if the target function has a separate entry point for accepting a tail call?)

If you pile up stack frames while matching regex, you will blow up on long matches. Eg. a+ matching string with thousands or millions of a's.

This could be made not to rely on tail call support by using a:

   typedef struct cont {
     void (*fn)(void *data);
     void *data;
   } cont_t;

which is returned from each function. A dispatch loop invokes the tail call by calling fn in the returned structure, passing it data. Needless to say, data can't be the address of some local variable in the function which returns this structure. Obviously, non-tail invocations of the continuations sometimes occur, and those have to be handled differently.

tinkersleep · on Dec 20, 2017

gcc generates 'jmp *%eax' (and other computed branch instructions) for this code, so yes, it uses tail-calls for function pointers.

fao_ · on Dec 20, 2017

Interesting! Does anyone know how much less efficient this is in practice?

burntsushi · on Dec 20, 2017

Typically regex parsing isn't benchmarked thoroughly, so you probably won't get any solid answers to your query. The reason why it isn't benchmarked thoroughly is because it's generally fast enough, and is typically dwarfed by other aspects of the regex compilation phase. (Which in turn varies with the regex engine, and depends on how much analysis it does up front. Some do very little, some do a lot.)

kmill · on Dec 20, 2017

This is both a regex parser and a regex matcher, and I believe fao_ meant the matching part.

phaedrus · on Dec 21, 2017

This is very close to logic programming, even down to the use of the return codes for "exception" handling (you can re-re-interpret the "return 0 means exception" as meaning logical negation-as-failure). Compare the Castor C++ logic programming library, which defines logic relations as functions which return true if they bind variables or false if they failed.

The use of continuation passing style is also apropos: compare the LC++ logic programming library which uses CPS to implement lazy streams for query evaluation. In LC++ an "OR" relation is implemented by calling the first argument with the continuation, and then the second argument with the continuation. An "AND" is similar, except it chains the continuations.

My personal interest in this topic is that I believe it should be possible to use continuation passing style and callbacks together to make an evaluation engine which "reorganizes" the C-stack to aid in debugging. It would work like this: the user code would utilize CPS, but the runtime engine, not the user code, has final say over where the continuation actually jumps. By priming the linked-list of continuations + context data the right way, the runtime could "stack the deck" so that when the user functions run, it creates a "fake" callstack that looks like a certain series of operations ran, which in reality may have actually been run a long time ago or may be composed of pieces of different chains of computation. In this way, the engine could communicate to the programmer + debugger a semantically meaningful summary of an otherwise convoluted / too detailed original computation. Or restart a computation which had to be previously abandoned due to needing data that wasn't available yet. I call this technique "stacking the stack" (in analogy to stacking the deck to cheat at a card game).

Also, although your technique of calling the continuation twice as in "return cont(a * a, data) && cont(aaa, data);" is perfectly valid (in fact, this code is also a syntactically valid Castor logic expression), here's another way it could be done: convert the nondeterministic choice into a deterministic one by lifting the information as to which branch is being executed into a variable which can be passed in. It would then look like:

int foo(int * next_part, int a, cont_t cont, data_t * data)

{

  switch (( * next_part)++)

  {

  case 0: return cont(a * a, data); 

  case 1: return cont(a * a * a, data); 

  default: return 0; // fail when exhausted 

  };

}

You're already doing this, sort of, in the regex engine when you do "switch ( * p)". The difference is, there, you're choosing between alternatives based on a concrete property of the input character. If you like, think of the "next_part" variable as an intangible or out-of-band property of the current input. Once you do that, you could even turn all of you "switch ( * p)" statements into "switch (( * next_part)++)" and simply _assume_ *p matched that case (we'll just fail inside the step if it didn't match, and backtrack). Congratulations, now you're doing logic programming!!