Simply Parse in C

jcranmer · on July 19, 2023

Quiz time, what's the bug in this C code:

  int parse_until(FILE *src, char *ptr, ssize_t maxlen, const char *s) {
    int out = 0;
    while (out < maxlen) {
        *ptr = fgetc(src);
        if (*ptr == EOF) { // hit error while scanning
            *ptr = 0;
            return ferror(src) ? -out : out;
        } else if (strchr(s, *ptr)) {
            *ptr = 0;
            return out;
        }
        ptr++; out++;
    }
    // we only make it here if we hit maxlen
    (*--ptr) = 0;
    int skipped = parse_skipwhile(src, s);
    if (skipped > 0) {
        return out + skipped; // errors are negative, eof is ok
    }
    return ferror(src) ? (skipped - out) : (out - skipped);
  }

Hint: it relies on implementation-defined behavior such that it (mostly) works on x86 but not on ARM.

(The fact that it silently clips the string if it's overlong is annoying, but it's not what I was thinking).

So EOF is defined as -1. Depending on how the environment chooses to define char, one of two things can happen. On systems like x86 where char is signed, it's impossible to read a byte value of 0xff, as it is confused with EOF. If instead you're on a system like ARM where char is unsigned, then you can never read EOF.

Joker_vD · on July 19, 2023

What's even with this style of code that seems to be trying to save up on local variables? It instead writes directly into the output buffer's pointer then immediately re-reads the value back from there and then performing the relevant logic (which may include un-writing the value from the buffer). Memory reads aren't that easily optimized away and this style easily leads to logic errors just like this one.

Also, I think you got the "what will work" question backwards with signed/unsigned. MSVC on x86/x64 has signed char and so, e.g., "isupper(c = fgetc(f))" segfaults on reading a non-ASCII char; and similarly in this case reading "я" from a Win-1251-encoded file (or "ÿ" from a Win-1252-encoded file) will be treated as (premature) EOF.

lelanthran · on July 19, 2023

I dunno about this style of code, it looks hard to follow :/

> No newlines in keys, values, or section names. Empty values are not allowed. Comments only on their own lines (minus whitespace). Whitespace-insensitive (whitespace at the start of line, end of line, around the “=”, is all ignored). No need for a terminating newline either. Oh that's more than most C ini parsers do? Isn't that convenient

Nonsense. My ini parser has fewer restrictions (only one, on line length) and looks nothing like that unreadable mess.

Maybe unrelated, but I'm pretty certain I've seen gcc warnings when assigning ints to chars, unsigned or otherwise

sltkr · on July 19, 2023

The compiler is able to warn you about this:

    <source>:8:18: warning: comparison is always false due to limited range of data type [-Wtype-limits]
    8 |         if (*ptr == EOF) {

Of course this warning only happens when you target ARM, so I can imagine it's still hard to catch if you do all your development on x86, and only occasionally cross-compile to ARM without heeding compiler warnings.

achates · on July 19, 2023

A good reason to always build with warnings as errors.

jstimpfle · on July 19, 2023

The bug is that it's missing abstractions. 1) reading directly from a FILE. 2) Saving local state through a pointer (ptr). 3) weird way to report back the out state, in case of EOF can't distinguish between ferror or not. 4) Probably it would be better to not check for errors right away -- just share a common path with EOF everywhere, and check for I/O errors only at the very end. 5) What is the use of this function? Not sure what's the point, it seems unergonomic. Probably better just code one loop for e.g. names ([_a-z][_a-z0-9]*) and integers / floats etc. 6) What's with the strange (*--ptr), overwriting previous work. Another bug waiting to happen. 7) Why are we returning out + skipped? Returning the consumed characters? That could just be coded in the parser input abstraction (where you would also keep track of the current file offset / possibly line and column (but you can compute those when you need them)). 8) What about the ferror(src) ? (skipped - out) : (out - skipped)? 9) 3 locations where a zero-terminator is written, when there should be _at most_ 1 location, but probably none.

If the code wouldn't do 2), the bug you describe would probably never have happened. The return value from fgetc() is int, not char. It has to be larger than char to be able to return a char as well as EOF.

I would probably code something along the lines of

    void identifier(Parser *parser)
    {
        My_String_Builder *builder = get_string_builder(parser);
        reset_string_builder(builder);
        for (;;)
        {
            string_builder_push(builder, c);
            if (! next_byte(parser))
                break;
            int c = parser_get(parser);           
            if (!(is_alpha(c) || c == '_' || ...))
                break;
        }
        String *string = string_builder_finalize(builder);

        // check string builder "overflow" / too long / further input sanitization

        push_string_token(parser, TOKEN_IDENTIFIER, string);
    }

bibanez · on July 20, 2023

Isn't this c++ not c, because in c there is no string class?

jstimpfle · on July 20, 2023

Just define your own one?

Someone · on July 19, 2023

> Hint: it relies on implementation-defined behavior

I don’t think so.

fgetc returns an int (https://en.cppreference.com/w/c/io/fgetc) and EOF is an int (https://en.cppreference.com/w/c/io)

So,

   *ptr = fgetc(src);

discards part of the result of fgetc.

If my C isn’t too rusty, that means the code can’t discriminate between hitting end of file and reading some byte value from the files (AFAIK EOF is -1 on most systems, so that value would often be 0xFF)

jcranmer · on July 19, 2023

EOF being -1 means that if (*ptr == EOF) can never fire if *ptr can only take on the values 0..255 [i.e., char is unsigned] as opposed to -128..127 [i.e., char is signed]. Whether or not char is signed or unsigned is implementation-defined.

boywitharupee · on July 19, 2023

This is interesting and I'm curious how is this being solved in the wild?

I'm on an ARM (m1 mac) machine and looks like char type is a signed type. EOF on this machine is -1

1. How can you distinguish between the char in the file and EOF (-1)?

2. Is EOF reserved for all files on macOS?

comex · on July 19, 2023

When fgetc succeeds, the int it returns has a value based on interpreting the byte as an unsigned char. On a platform with an 8-bit char, that means it’s in the range 0 to 255. This is required by the C spec regardless of whether char is signed or unsigned. Meanwhile, EOF is required to be negative. Thus, you can always distinguish the cases as long as you look at the original int return value rather than casting to char.

Fun fact: This approach causes trouble on obscure embedded platforms where char and int are the same size (and therefore an unsigned char value can’t fit inside a signed int). Such platforms are allowed by the C standard as freestanding implementations that don’t implement the full standard library, but they can’t conformantly implement fgetc. https://stackoverflow.com/questions/3860943/can-sizeofint-ev...

jcranmer · on July 19, 2023

Every byte in a file, as represented in an int, takes on a value of 0..255. fgetc doesn't return a char, it returns an int, which means it returns a value of -1..255 (i.e., 257 possible values). If you try to represent the return value of fgetc as a char, two of those values get the same representation, namely 255 and -1. The difference between ARM and x86 is that on ARM (unsigned char), the -1-as-255 is represented as 255 when cast back to an int for comparison, whereas on x86, the 255 when cast back to an int is -1, so it would return the same value as EOF (although it would also do so had the value in the file originally been -1).

Representing binary data as unsigned char (as opposed to char or signed char) is the norm however.

Gibbon1 · on July 19, 2023

Solve in the wild by never clipping the output of things like fgets()

    int ch = fgetc(src);
    if(ch == EOF)
       goto oops_eof;

    out++;
    *ptr++ = ch; *ptr = '\0';

    if(strchr(s, ch))
      return out;

or maybe

    int ch;
    if((ch = fgetc(src)) == EOF)
       goto oops_eof;

    *ptr[out++] = ch; *ptr[out] = '\0';

    if(strchr(s, ch))
      return out;

634636346 · on July 19, 2023

This (distinguishing -1 from a character value when char is unsigned) is one of the reasons character literals in C have type int instead of char (C++ changed them to char.) Note the return type of getchar().

atahanacar · on July 19, 2023

>fgetc() reads the next character from stream and returns it as an unsigned char cast to an int, or EOF on end of file or error.

>getc() is equivalent to fgetc() except that it may be implemented as a macro which evaluates stream more than once.

>getchar() is equivalent to getc(stdin).

WesolyKubeczek · on July 19, 2023

One thing I'm seeing: fgetc returns an int, so on a big-endian architecture the `*ptr = fgetc(src);` might yield even more interesting results.

Then, that fgetc() returns EOF on either an EOF or error is its major weakness — mixing in data and control all in-band — one is supposed to check with feof() and ferror() which of those two have happened, and if neither is true, then it's a data byte that just happens to be equal to the EOF constant. (This makes it quite pessimal when you are reading in a file that happens to contain lots of bytes equal to EOF, though).

> If instead you're on a system like ARM where char is unsigned, then you can never read EOF.

Then you should probably use an intermediate variable which is a proper int, gives you a surefire way to tell 0x000000ff and 0xffffffff apart.

But it's all gotchas like this which make me cringe every time someone suggests C is oh so very good and "simple", or even (gasp!) "convenient" to work with strings in general and text in particular. It's anything but. Hell, maybe it was better than alternatives back in some 1976, but then awk and Perl got invented, and Python followed soon after.

kabdib · on July 19, 2023

char cannot legally represent EOF (as you point out). This doesn't really have anything to do with ARM, though; for instance, the ARM compiler I use has command-line options and pragmas that control whether char is signed or unsigned. Char is just too small.

The type of 'out' is a different type than ssize_t, which may not matter for the practical ranges involved in parsing a file, but it certainly a bad sign.

Writes to memory it doesn't own if maxlen == 0 (unlikely, but who knows?)

Additionally: calling strchr on every input character is crazy. Frankly, the same applies to fgetc.

dleslie · on July 19, 2023

Ah, implicit type conversions being a footgun yet again.

    int fgetc( FILE *stream );

But the retval is treated as a char.

kevin_thibedeau · on July 19, 2023

The retval is an int. It only gets treated as a char if it isn't EOF. The caller is supposed to know this and do the right thing rather than blindly downcasting. This is more apparent with the wide char library where there is a special wint_t with the sole purpose of serving as a return value that represents a wchar_t or WEOF.

mananaysiempre · on July 19, 2023

AFAICS this is actually potential UB on most implementations: not only does fgetc() return -1 as an int in the failure case, it returns the byte as unsigned char in the success case. So on an implementation that defines plain char to have the same size and representation as signed char (as x86 does), reading a byte beyond SCHAR_MAX causes UB through signed overflow on the first line of the loop.

(This is a close relative of the well-known footgun that the is*() functions from <ctype.h> accept an unsigned char value as an int, and it’s completely valid—though rare in practice—for them to blow up when passed, say, "\xFF"[0] instead of ((const unsigned char *)"\xFF")[0] on an implementation with CHAR_BIT==8 and SCHAR_MIN<0.)

LegionMammal978 · on July 19, 2023

Overflow from signed arithmetic always results in UB. But an unsigned-to-signed conversion outside the range of the target type just results in implementation-defined behavior; see C17 §6.3.1.3 ("Signed and unsigned integers"). And most people these days (including me) take the simple approach of not supporting any implementation that doesn't just perform two's-complement wrapping for all conversions.

jcranmer · on July 19, 2023

> And most people these days (including me) take the simple approach of not supporting any implementation that doesn't just perform two's-complement wrapping for all conversions.

C2x in fact requires 2's complement.

LegionMammal978 · on July 19, 2023

Not quite; C23 requires a two's-complement representation, but it doesn't require two's-complement wrapping for unsigned-to-signed conversions. §6.3.1.3 hasn't been changed, so it's still left to the implementation to guarantee that. The only easily-visible effects of requiring a two's-complement representation are that X_MIN == -X_MAX - 1 for all signed integer types, and in general that the object representations of x and -x - 1 differ by only 1 bit, if the integer type has no padding bits.

mananaysiempre · on July 20, 2023

Ah, so the “two’s complement from outer space” option (with X_MIN == -X_MAX and a trap representation in place of -XMAX-1) is also finally gone? I did not notice that, thanks.

C18 6.2.6.2p3 had:

> [It] is implementation-defined [...] whether the value with sign bit 1 and all value bits zero [...] is a trap representation or a normal value [for two’s complement].

j16sdiz · on July 20, 2023

Is there any modern day architecture don't use twos complements?

olliej · on July 20, 2023

GPUs notionally - I can't speak for truly "modern" GPUs but I recall older GPUs having language level "char"/"byte" implemented as a float (presumably with some minimal support to get expected semantics like clamping?)

gsliepen · on July 19, 2023

Another bug: if `maxlen <= 0`, then it will overwrite the character before the start of `ptr`. And then there's the infinite loop if `maxlen > INT_MAX` on 64-bit machines. And maybe `out + skipped` can wrap as well?

NobodyNada · on July 19, 2023

One more: if an error occurs while reading the first character, the function returns -out = 0, which violates the documented behavior that errors are negative, and means a caller can’t distinguish between a failing to read and reading an empty file.

jcranmer · on July 19, 2023

maxlen is meant to be a constant provided by a #define, so I wouldn't be worried about things going haywire if maxlen is a weird value.

But C-like code in general makes me nervous in parsers, since there's many ways things can go wrong and C is inherently fail-deadly if you make a mistake.

jstimpfle · on July 19, 2023

Coding in C doesn't mean that you shouldn't create a layering of appropriate abstractions.

SubjectToChange · on July 20, 2023

The problem is that writing abstractions in C is more difficult than virtually any other practical language.

jstimpfle · on July 20, 2023

That's a wild claim. Why do you think is that?

What is an abstraction and what do you need to code it?

dalke · on July 20, 2023

Years ago I ported some software from IRIX to some Linux-based OS. There were a few differences in the result. I eventually tracked it down to how on IRIX char is unsigned while the Linux-based OS used signed char.

The program used a 'char' to store the value of the formal atomic charge of an atom, which is typically 0, but for the sorts of chemistry I deal with can be +1, +2, -1, or -2, so it makes sense to allocate only 8 bits to store the 'char'ge. :)

On IRIX, where they had deployed the code for years, the charges were actually being interpreted as 0, +1, +2, 255, and 254.

Going back to the code example you gave, isn't there also an issue if read failure occurs on the first byte? Then `out` is 0, returning -0, which is 0, giving no way for the caller to distinguish between EOF and read failure.

zh3 · on July 19, 2023

That's a simple case, I've seen much worse because of endianess and different archs have different conventions for whether char is signed or unsigned by default (common these days with arm/X86). Any C programmer who's coded for X86 and ARM will know what the rules are and how not to break them (I'd like to hope, anyway).

mynegation · on July 19, 2023

I almost got it. I was like “whoa wait, does not fgetc return int? You truncate it to char, can you even compare it to EOF constant now?”. I did not remember the actual EOF constant or that char is unsigned on ARM, but I think I would pay attention to the compiler throwing a warning at me (it does throw a warning, does it not?)

synergy20 · on July 19, 2023

fgetc() must return to int instead of char, otherwise you can not detect EOF.

luizfelberti · on July 19, 2023

Cute, now do it with UTF-8 support.

> People are terrified of parsers and parsing

And rightfully so. People who aren't afraid of them generally fail to understand all of the ways in which parsing can show fractal complexity, and will mostly stick to toy examples like this INI parser to justify their positions.

If you're gonna argue that parsing is simple, the bare minimum I'd want to see implemented is a context-sensitive grammar with unbounded lookaheads (or at the very least, that is capable of handling more than one token of lookahead), with proper support for Unicode, and actual error resilience (not what this article calls error resilience)

If you manage to do all that and can still call what you did "simple" without having completely deluded yourself, congratulations, I hope to be on your level some day.

PS1: I won't even go into the plethora of security issues originating from crappy parsers, especially those written in C

PS2: Let's also leave aside any matters related to correctness and validation of parsers, which are notoriously not by any means "simple".

PS3: Or generating decent errors for that matter.

jstimpfle · on July 19, 2023

UTF-8 was designed so that you don't have to worry about it. Supporting UTF-8 in a parser is trivial, basically just parse as if it were ASCII but don't barf on the bytes >= 128.

As long as all your delimiter chars are ASCII, it just works.

Errors in C are usually because of missing abstractions or the wrong approach. C gives you data layout, flow control, and functions, you can go a long long way with just that.

> unbounded lookaheads

If you want to require that, you get what you deserve. But implementing it is just a matter of putting a queue of tokens in front of your parser that supports look(n) separately from consume().

_gabe_ · on July 20, 2023

> Cute, now do it with UTF-8 support.

I see this type of sentiment a lot and I'm not sure why this exists. Maybe it's because there were a bunch of formats in the past and it made it more difficult? Idk.

Anyways, I finally decided to "bite the bullet" and prepared a solid week to finally do the "nitty gritty" of writing a UTF-8 validator/logging library. Turns out, it was super easy and took me like an hour to read through the RFC and maybe 2 more hours to write a simple implementation.

For anyone that's curious, give it a read here[0], it's surprisingly readable and the format is very simple and elegant. I don't say simple as in dumb either, I say simple as in they made the problem as simple as it needs to be with no unneeded complexity, and it's a breath of fresh air.

Also, it's written in such a way that any valid ASCII is valid UTF-8. So at the very least, you can just check if you encounter any bytes with the highest bit set in the string before parsing. If that's the case you can throw an error saying you don't support UTF-8 and avoid parsing potentially invalid data (not that it's particularly difficult to validate the UTF-8 if you want to).

[0]: https://datatracker.ietf.org/doc/html/rfc3629#section-3

pjmlp · on July 20, 2023

Have you validated that simple implementation with anything besides English?

_gabe_ · on July 20, 2023

Yes.

Let me know if you see any bugs[0], I'll add it to my regression tests.

[0]: https://github.com/ambrosiogabe/CppUtils/blob/master/single_...

pjmlp · on July 20, 2023

Fair enough.

I was thinking about the usual cases where lowercase and uppercase character count don't match, non-latin character based languages and so on.

eatonphil · on July 19, 2023

Can't parsers pretty easily handle UTF-8 if you just consider identifiers (and strings) as bags of bytes?

luizfelberti · on July 19, 2023

Depends on where you draw the line of what a parser is:

- If the parser is "the thing that comes after the lexer" then all of this is abstracted away by the lexer and you can just treat it as a span of bytes;

- If the parser is "everything that needs to be implemented to correctly transduce the input sequence into a tree", then you need to implement this yourself or have a lexer that handles this for you, usually done by having a tiny UTF-8 codepoint recognizing FSM in your lexer (UTF-8 is a self-synchronizing code, which makes this part easier) and ignoring the existence of graphemes.

Most people, however, shy away from implementing a parser "all the way down to the bytes" and properly handling UTF-8 as a formal language. Most lean on a lexer abstracting this away. Ditto for context-sensitivity.

Recently Rust's regex engine underwent a major overhaul, and burntsushi wrote a blog post[0] about doing the "all the way to the bytes" thing in the new regex engine, I highly recommend the read:

[0] https://blog.burntsushi.net/regex-internals/#nfa-optimizatio...

eatonphil · on July 19, 2023

Yeah I'm saying why does your lexer actually need to be UTF-8 aware? (An actual question, because maybe I'm not thinking of some obvious case.)

Most of the lexical/syntactic elements of languages are not in UTF-8. You're looking for things like semicolons and quotes and whitespace. If you don't change the language syntax/lexical elements so that those parts stay as the ASCII subset of UTF-8 then why does your lexer need to be aware of UTF-8? It can just accumulate everything else as bytes and it doesn't matter what format the bytes are. The parser and/or codegen will do equality checks for lookups later on but that doesn't need to be UTF-8 aware either?

Am I missing something?

dezgeg · on July 19, 2023

If you don't want to error/warn on invalid UTF-8 but instead handle it with the "garbage in, garbage out" principle, then yes you're right, treating them pure byte streams works.

eatonphil · on July 19, 2023

Yeah that makes sense. It doesn't really strike as the job of the compiler/parser to validate UTF-8. If you've got a messed up text editor/OS environment that's going to be a problem for lots of things.

luizfelberti · on July 19, 2023

What dezgeg said is pretty much spot on, and also I think what you're describing related to "compiling the codepoints down to bytes" is in many ways equivalent to handling the UTF-8.

My opinion is, stated in a way that a TigerBeetler will resonate with ;), is I want to be able to handle radioactive levels of corruption in my inputs, and still parse them without blowing up, and issuing great error messages along the way.

eatonphil · on July 19, 2023

Ok, one reason I can think of why you'd want to be UTF-8 aware is so that your error messages at any part of the parser could point to the exact column in the line of text. The line number you could get without being UTF-8 aware. But the column number you couldn't get without being UTF-8 aware.

TRiG_Ireland · on July 19, 2023

> The line number you could get without being UTF-8 aware.

Can you? Unicode has the following "new line" characters:

* U+000A Line Feed (LF) alone

* U+000D Carriage Return (CR) alone

* CRLF as one indivisible sequence

* U+000B Line Tabulation (VT) — supporting this is explicitly optional, and the main standard's newline function definition does not include it

* U+000C Form Feed (FF)

* U+0085 Next Line (NEL), an EBCDIC round-trip compatibility character

* U+2028 Line Separator (LS)

* U+2029 Paragraph Separator (PS)

My source: https://langdev.stackexchange.com/a/590/717

eatonphil · on July 19, 2023

Yes whitespace in unicode is expansive. However, you could (and I assume most languages do) specify that a newline is \n or \r\n which are expressible in ASCII.

Maybe I'm wrong though, just an assumption about what's common.

(See for example how Go, which is Unicode aware, defines tokens: https://go.dev/ref/spec#Tokens.)

luizfelberti · on July 19, 2023

There are also other concerns depending on your threat model: if you're parsing user-generated strings you definitely want to be able to handle corrupted unicode, for security reasons, and in these scenarios the way you handle recovery if you choose to do so may aggravate exploitation.

Consider a corrupted codepoint at the end of a user generated string: will it recognize the closing quote as such, or will it assume it is part of a corrupted codepoint and try to skip over it?

So many ways to shoot yourself in the foot by "abstracting away" the formal semantics of your inputs, I think it's pretty much never worth it. (An interesting search term here is LangSec)

eatonphil · on July 19, 2023

> Consider a corrupted codepoint at the end of a user generated string: will it recognize the closing quote as such, or will it assume it is part of a corrupted codepoint and try to skip over it?

Maybe I'm misunderstanding you, but because of how UTF-8 is a superset of ASCII, I don't believe you can misrecognize ASCII characters if that's what you mean.

luizfelberti · on July 19, 2023

You are correct that this is a detectable and entirely preventable failure, however, this is the way in which this can manifest:

- UTF-8 is a prefix-free self-synchronizing code;

- If the first byte of a UTF-8 codepoint starts with 0b0??????? then it is ASCII, and all is well;

- If the leading byte of the codepoint is 0b110? it means there is one continuation byte to follow. If its 0b1110? there are two bytes to follow, and so on up to a maximum of 4 continuation bytes, which is the limit for UTF-8;

- All continuation bytes have the pattern 0b10? and UTF-8 self synchronizes based on detecting the leading byte;

- The correct way to parse UTF-8 is to not believe these lengths AT ALL and actually run the UTF-8 state-machine over the entire input, which can be made quite fast by leveraging bit-parallel techniques (see Daniel Lemire's work);

- The way you shoot yourself in the foot is by believing the length and skipping over those bytes: an attacker makes the last codepoint one that expects a single continuation byte but does not include the continuation byte, the fancy pantsy "optimized" parser will skip over the closing quote and decohere the parse. This is only safe to do on pre-validated input, but even then it's kind of not worth it if you have access to a SIMD accelerated UTF-8 validator

Hope this clears it up!

PS: I DMed you on Discord ;)

duped · on July 19, 2023

Do you treat non-ascii whitespace as whitespace or valid parts of a lexeme?

classified · on July 19, 2023

… or use `fgetwc()`.

spc476 · on July 19, 2023

C programs start out in the "C" locale, so just using fgetwc() won't work out of the box (or won't do what you expect it to do). You'll need to call setlocale("") to get the expected behavior.

Joker_vD · on July 19, 2023

> a context-sensitive grammar with unbounded lookaheads

You may want to look at chibicc, a toy (but self-hosting) C compiler written in C, which treats C grammar as if it was pretty much that.

Of course, the sane way is to not invent languages that can be naturally described only by a context-sensitive grammar with unbounded lookahead.

luizfelberti · on July 19, 2023

Agreed on all counts, except with the remark that even with sane grammars unbounded lookahead will still appear if you want to have IDE-grade error resilience.

But I wholeheartedly agree with the sentiment of "don't make the grammar look like Scala" <3

klodolph · on July 19, 2023

The “IDE-grade” error resilience can be approached in many ways, and unbounded lookahead is often unnecessary. The main problem you want to solve is the problem of recovering from an error, and parsing more of the file correctly after you encounter an error. One way you can do this is by finding statement or declaration boundaries, which can be done in surprisingly simple ways.

Depending on the language.

luizfelberti · on July 19, 2023

I mostly agree with this, but what I mean with the unbounded lookahead part of it is that bounding the amount of speculation (or lookaheads/backtracking) is equivalent to limiting the "size" of the error you can recover from.

You should definitely have bounds though, but the point is that if it's too low you might give up on the input too soon.

aappleby · on July 19, 2023

I can parse all of GCC's C torture tests with basic UTF8 support in my "toy" PEG-based parser, does that count?

defen · on July 20, 2023

Is your parser available anywhere?

mrkeen · on July 19, 2023

> I hope to be on your level some day

Sometimes you just haven't met the right abstraction yet. I'll be that guy talking about parser combinators hopefully before the rest of this thread fills up with them. I don't think my parser does everything in your bare minimum (I haven't really thought about utf-8!) but it does do some other pretty advanced stuff. For example it leans on white-space pretty hard to figure things out. No curly braces or semicolons, and parentheses are only for precedence, not function application.

What my parser does do:

* backtracking/alternatives

* Some context-sensitivity, in-so-far as it can tell a negate from a minus.

Where it got a little hard:

* I realised I was parsing division the wrong way. a/b/c/d became a/(b/(c/d)), not the other way around.

Where it got medium hard: * Distinguishing unary minus from binary minus. I thought it would be really hard, but I only needed to look at the previous token to decide whether something was a TokNegate or a TokMinus.

Where it got hard:

* White-space/indentation sensitivity. I needed to first calculate the line-breaks and make that information (gathered during lexing) available during parsing.

Where it got really hard:

* LEARNING how to factor out the left-recursion. There were times when I literally thought it was impossible. I knew about 'precedence' in the back of my mind, but I didn't realise how the concept mapped to the code yet. By example: one sumExpression is (many or one productExpressions separated-by-'+'), and one productExpression is (many or one unaryExpression separated-by-'*'), and so on. You don't end up in an infinite-parse-loop if you try to parse the least-tightly-binding expressions first (which just seemed so counterintuitive that I guess I never tried?).

But I've yet to say why I like parser combinators so much (and think they're at least the 'simplest' way to do things, if not 'simple'):

You get to write code which looks like the bnf definition !

Just like TFA I'll take a lua example[1]

    var ::=  Name  |  prefixexp `[´ exp `]´  |  prefixexp `.´ Name

I would code this something like:

    var <- name <|> case2 <|> case 3
    where
    case2 = do
        pe <- prefixexp
        e  <- char '[' *> exp <* char ']'
        return (pe, e)

    case3 = do
        pe <- prefixexp
        _  <- char '.'
        n  <- name
        return (pe, n)

It more or less maps exactly onto the BNF, and in the above case, the extra complexity came from capturing the subexpressions and returning them to the caller. If I wrote a grammar to simply accept/deny its input (rather than trying to build an AST out of it), it could resemble the BNF even more:

Bnf definition vs. executable code:

    var ::= Name   |   prefixexp         `[´    exp          `]´  |   prefixexp         `.´    Name  
    var   = name  <|> (prefixexp >> char '[' >> exp >> char ']') <|> (prefixexp >> char '.' >> name)

I will say one other thing about the simplicity, which is - I didn't use an existing parser combinator library. They're simple enough to roll your own. There's only one trap which I can think of, which is where to draw the line on automatic-backtracking. I.e. Should the caller explicitly need to insert 'try's to enable backtracking.

> Cute, now do it with UTF-8 support.

Ironically I think this is the one feature where I'd prefer to be in C. C's approach with bytes is perfectly forward-compatible. A higher-level language might be more opinionated about its String type (restricting what you can or can't accept with your parser) or have funny definitions about length().

[1] http://parrot.github.io/parrot-docs0/0.4.7/html/languages/lu...*

blueblob · on July 19, 2023

I don't know if I agree with the premise:

> People are terrified of parsers and parsing

But this implementation of a reduced feature set version of .ini file parsing does not convince me that I should write my own parser instead of using one that implements a more full feature set

> No newlines in keys, values, or section names. Empty values are not allowed. Comments only on their own lines (minus whitespace). Whitespace-insensitive (whitespace at the start of line, end of line, around the “=”, is all ignored). No need for a terminating newline either.

I think it's reasonable that people want to use a parser that has better error handling and gives an idea of where the ini file may have parsing problems than just a barebones implementation such as this provides.

I also think that using a library for parsing instead of writing your own parser does not imply that you are scared of parsing.

lylejantzi3rd · on July 19, 2023

> But this implementation of a reduced feature set version of .ini file parsing does not convince me that I should write my own parser instead of using one that implements a more full feature set

A performance comparison on a large ini file might.

You may want to take the time to watch this video[0]. In it, Andreas Fredriksson walks though his reasoning for writing his own parser instead of using the standard json parser.

[0]: https://vimeo.com/644068002

dezgeg · on July 19, 2023

Performance comparison of this code might not, however, as fgetc() based parsing is not efficient. (Each call takes a lock in the FILE object).

delusional · on July 19, 2023

I don't think the author is necessarily aiming at purpose built file parsers with that statement, but rather "parser generators" like ANTLR.

notbeuller · on July 19, 2023

I don't think I've ever seen ungetc() used in production code.

I appreciate the "do it yourself" thing - but in 2023 unless you're building a product for a known ascii system, you're setting yourself up for pain when your code is run in San José,

NikkiA · on July 19, 2023

> I don't think I've ever seen ungetc() used in production code.

I have, I wish I hadn't, but there we go.

I'd rank it up there with f77 return labels as being coding clusterfucks.

dfox · on July 19, 2023

ungetc() exists for exactly this application. But when you want more abstract input stream than FILE* you have to reimplement that yourself on higher layer. Ironically even libc implementation needs an more generic abstraction to implement scanf() and sscanf() (ie. you either have two separate implementations of these or don't internally use ungetc())

benhoyt · on July 19, 2023

This looks interesting. I created and maintain a library for INI parsing that got surprisingly popular -- it's tiny, so is good for embedded systems. This API has a very similar feel to mine, including the callback for every key/value pair with a void* userdata. https://github.com/benhoyt/inih

NikkiA · on July 19, 2023

I don't really understand this, I'm comfortable enough programming in C, but I find it the least suited to parsing behaviour compared to pretty much every other higher-level language. In particular, I find parser combinators (be it in lisps, MLs, haskell) to be infinitely more approachable and mentally managable than anything easily doable in C.

(Yes, you can do parser combinators in C, but it's very very ugly unless you stick to clang and enable block support).

If anything I'd expect to be told to use a high(er) level language for parsing as a front end to C code if needed.

Izkata · on July 19, 2023

> I don't really understand this

If you just jumped to the code, it's in the first 2.5 sentences:

> People are terrified of parsers and parsing. To the point of using magical libraries with custom syntaxes to learn just to get started. In the hopes of completely shattering this preconception

Seems mostly to me about making parsing less "magical" for people who don't understand what's really going on.

mrkeen · on July 19, 2023

I agree with your description of TFA's goal, but I don't think the article hits that goal.

Parent is right in that other approaches (parser combinators) can actually make parsing less scary.

UncleOxidant · on July 19, 2023

> People are terrified of parsers and parsing

As if parsing in C is going to make them less terrified.

jheriko · on July 19, 2023

Good to see someone avoiding horrible parser generators, even if the code is ugly, poorly styled, and bug prone.

"In short, there are a few reasons that parsing is a mess, and none of those reasons are actually resolvable by parser generators."

I'm pretty sure this is untrue /and/ part of the problem. Build quality on these tools is appalling...

Twirrim · on July 19, 2023

I'm not particularly familiar with C

    // if the callback returns non-zero, parsing will stop
    typedef int (*callback)(const char*, const char*, const char*, void*);

How does that typedef ensure the behaviour mentioned in the comment, or are they unrelated?

woodruffw · on July 19, 2023

It doesn’t; it’s a convention that the programmer is expected to honor. You could pass in a different function pointer instead (or something that isn’t a function pointer at all), and the behavior would be undefined.

gizmo686 · on July 19, 2023

People are terrified of writing parsers, because parsers are a major source vulnerabilities. Particularly when implemented in memory unsafe languages.

jstimpfle · on July 19, 2023

Total hogwash. It's extremely easy to write parsers in a safe way. I've done a few and can do it in my sleep. Reading chars one by one in the lexer, reading tokens one by one in the parser. Pushing stuff on stacks. The "dangerous" parts can be reduced to a dozen lines.

Coding in C doesn't mean that you can't code function abstractions. Instead of storing things through pointers with pointer arithmetic and indexing all over the place, and doing manual bounds checks everywhere, you code a few functions like push_char(), push_token() etc.

pjmlp · on July 20, 2023

True in theory, that is why given my Turbo Basic, Turbo Pascal, Modula-2 background, when I code in C, I do exactly that.

Follow pseudo ADT (Abstract Data Types) approach, with the classic modulename_function() pattern, everything that shouldn't be exposed is marked as static symbols on the implementation file, and for the few cases where there is a possible performance impact using ADTs, there is one or other macro.

However, that is not how most C developers program, regardless of how many books, ACCU and The C Programmers Journal articles, and conference talks on how to write proper C have been written.

graphviz · on July 19, 2023

The difficulty of implementing parsers (for the majority of programmers or engineers) is one of the tragedies of computer science. It is a very well developed theory and apparently it is too hard to deliver in a usable form without becoming one of those "to bake an apple pie you need to invent the universe" experiences. I just wanted a pie! Like the html-ish parser in Graphviz:

  bison -y -Wno-yacc -dv ../../lib/common/htmlparse.y -o htmlparse.c
  ../../lib/common/htmlparse.y: warning: 2 shift/reduce conflicts [-Wconflicts-sr]
  ../../lib/common/htmlparse.y: note: rerun with option '-Wcounterexamples' to generate conflict counterexamples

Noted.

This partially explains the success of languages like XML, JSON and YAML as alternatives to writing your own parser.

dfox · on July 19, 2023

The trick for “lets hand write a parser, it is simple” is in coming up with grammar that is either LL(1) or slight superset of that that can still be parsed by recursive descent with hand rolled deconflicting rules (prime example of that is if-else in C). The issue with that is you cannot just take an arbitrary context-free grammar with attached semantics and mechanically transform it into LL(1) form, because while doing that you are moving production rules around and thus you lose any kind of semantics that were originally attached to them. This is the reason why the parser generators tend to be (LA)LR, you can produce such parser for wide (albeit in practice ill-defined) subset of context-free grammars purely mechanically without having to change the meaning of the attached semantic rules (typically represented as a bunch of copied code).

inimino · on July 20, 2023

The thing is that parsing is a solved problem and has been for decades. But because this sounds like you must be incompetent if you can't do it, people get very defensive if you say it's easy. Everyone who believes that parsing is terrifying is highly invested in that belief. And it does take time to learn how parsers work. Write a parser, a parser generator, a regex engine, and then tell everyone else "it's easy" and see how that goes.

enriquto · on July 19, 2023

> I will write a parser for the “ini” file format in about 150 lines of pure and readable ISO C99

Sounds like overkill. Most often you don't need a full-fledged ini format, but just a list of "KEY=value" pairs that can be parsed with a single call to scanf.

dfox · on July 19, 2023

scanf() is probably not the right approach to that, but still you can implement a parser for the common INI syntax in about a third of the code with fgets(), strchr()/str(c)spn and bunch of hand-rolled ad-hoc logic.

willtemperley · on July 20, 2023

Agreed in principle - we shouldn't be afraid of token streams. Thinking in terms of JSON:

1. Its portable - JSON token streams look pretty similar in every language.

2. You're in control. Switching parsing libraries is painful when it's baked into your code (I'm currently weeding out a now unsupported parsing lib and it's painful).

3. It's flexible - try parsing heterogenous websocket JSON streams with e.g. Swift's Codable. Possible: yes. Easy: no.

4. It's really fast.

If you're in control of the source and sink, the above probably doesn't apply - then it's trivial to make struct-based parsing fast and easy.

(edited to add line breaks)

olliej · on July 20, 2023

Ignoring the numerous correctness issues people have pointed out, parsing .ini simply isn't interesting as a thing to parse. In the sense that more or less any real data/source that you have to parse is more complex and structure than an ini so any "lesson" from parsing ini is inapplicable to anything else.

inimino · on July 20, 2023

"Any more or less real programming language is more complex than an arithmetic expression language with only four operations so any 'lessons' from such a simple example is inapplicable to anything else."

olliej · on July 20, 2023

The point is that this article says it wants to introduce parsing. But parsing a language that can be parsed by string splitting is not meaningful.

If there isn't even the most basic complexity then it's hardly worth claiming you're introducing parsing.

inimino · on July 20, 2023

It also says it's LL(1) recursive descent, so you should already know what that includes.

n4te · on July 19, 2023

Ragel is interesting for some parsing problems.

kunley · on July 19, 2023

[flagged]

Warwolt · on July 19, 2023

What the fuck are you talking about