If you look at where they are used, then it looks like these macros do not need to compare m[1] to c1 because whenever they are used it is already known that m[1] == 'O' (hence "str3O" in the name -- that's 3 and the letter O, not number zero).
I guess c1 is included in the first case because it's faster to compare the whole 32-bit word than mask out the known byte (and since c0, c1, c2 and c3 are compile time constants, the whole right-hand-side of the comparison will be optimized by constant folding anyway).
(It's ridiculous micro-optimization in my opinion, but at least it is correct, and it is limited to a single file.)
Mmm, I would say it's definitely formatted beautifully. But I agree with the poster above in saying that it's error prone.
I'm a big Nginx fan, but I think it's good by just sheer force of will... if you had like 10 people working on it it would probably fall apart. Something sqlite at least has extensive automated tests, which makes me much more confident about its quality.
I spend most of my time in dynamic languages so I have to ask... is this that much faster than a good regex library that it warrants a hand rolled state machine? How normal is something like this in a typical C/C++ codebase?
A "good" regex library will dynamically generate code; and if the regex is simple enough, generate code that implements a DFA rather than NFA (or PDA, for Perl regexes). So for a "good" library, no, it won't be much faster.
But very few regex libraries are that "good", because it's a combination of extreme speed with very low flexibility (DFA has exponential state explosion in worst cases compared to equivalent NFA, and isn't able to deal with e.g. backreferences). The vast majority of good regex libraries will be an order of magnitude slower. Average regex libraries included in most language distributions will be slower again.
The chief exception is compiler lexer generators like lex and flex. They produce code very similar to the state machine linked. And that's probably the most common place to see this kind of thing.
Encoding the state of the machine implicitly as the program counter, rather than an explicit state variable, often results in more readable code and is the more usual way to do it when writing it by hand. It also saves a register, important on some architectures. But the technique has slightly more limited expressiveness owing to needing to stick with structured programming constructs.
People (should) only implement things in C/C++ after thoroughly profiling the dynamic language POC and being sure they need the performance that a C implementation affords. So yes, the C implementation is going to sacrifice readability for performance. Of course, there are better and worse ways of accomplishing the same thing.
I understand that they're doing string comparisons (of a kind), but what are those values defined in 'uint32_t'? Is it a translation of values into memory addresses?