int parse_until(FILE *src, char *ptr, ssize_t maxlen, const char *s) {
int out = 0;
while (out < maxlen) {
*ptr = fgetc(src);
if (*ptr == EOF) { // hit error while scanning
*ptr = 0;
return ferror(src) ? -out : out;
} else if (strchr(s, *ptr)) {
*ptr = 0;
return out;
}
ptr++; out++;
}
// we only make it here if we hit maxlen
(*--ptr) = 0;
int skipped = parse_skipwhile(src, s);
if (skipped > 0) {
return out + skipped; // errors are negative, eof is ok
}
return ferror(src) ? (skipped - out) : (out - skipped);
}
Hint: it relies on implementation-defined behavior such that it (mostly) works on x86 but not on ARM.
(The fact that it silently clips the string if it's overlong is annoying, but it's not what I was thinking).
So EOF is defined as -1. Depending on how the environment chooses to define char, one of two things can happen. On systems like x86 where char is signed, it's impossible to read a byte value of 0xff, as it is confused with EOF. If instead you're on a system like ARM where char is unsigned, then you can never read EOF.
What's even with this style of code that seems to be trying to save up on local variables? It instead writes directly into the output buffer's pointer then immediately re-reads the value back from there and then performing the relevant logic (which may include un-writing the value from the buffer). Memory reads aren't that easily optimized away and this style easily leads to logic errors just like this one.
Also, I think you got the "what will work" question backwards with signed/unsigned. MSVC on x86/x64 has signed char and so, e.g., "isupper(c = fgetc(f))" segfaults on reading a non-ASCII char; and similarly in this case reading "я" from a Win-1251-encoded file (or "ÿ" from a Win-1252-encoded file) will be treated as (premature) EOF.
I dunno about this style of code, it looks hard to follow :/
> No newlines in keys, values, or section names. Empty values are not allowed. Comments only on their own lines (minus whitespace). Whitespace-insensitive (whitespace at the start of line, end of line, around the “=”, is all ignored). No need for a terminating newline either.
Oh that's more than most C ini parsers do? Isn't that convenient
Nonsense. My ini parser has fewer restrictions (only one, on line length) and looks nothing like that unreadable mess.
Maybe unrelated, but I'm pretty certain I've seen gcc warnings when assigning ints to chars, unsigned or otherwise
<source>:8:18: warning: comparison is always false due to limited range of data type [-Wtype-limits]
8 | if (*ptr == EOF) {
Of course this warning only happens when you target ARM, so I can imagine it's still hard to catch if you do all your development on x86, and only occasionally cross-compile to ARM without heeding compiler warnings.
The bug is that it's missing abstractions. 1) reading directly from a FILE. 2) Saving local state through a pointer (ptr). 3) weird way to report back the out state, in case of EOF can't distinguish between ferror or not.
4) Probably it would be better to not check for errors right away -- just share a common path with EOF everywhere, and check for I/O errors only at the very end.
5) What is the use of this function? Not sure what's the point, it seems unergonomic. Probably better just code one loop for e.g. names ([_a-z][_a-z0-9]*) and integers / floats etc.
6) What's with the strange (*--ptr), overwriting previous work. Another bug waiting to happen.
7) Why are we returning out + skipped? Returning the consumed characters? That could just be coded in the parser input abstraction (where you would also keep track of the current file offset / possibly line and column (but you can compute those when you need them)).
8) What about the ferror(src) ? (skipped - out) : (out - skipped)?
9) 3 locations where a zero-terminator is written, when there should be _at most_ 1 location, but probably none.
If the code wouldn't do 2), the bug you describe would probably never have happened. The return value from fgetc() is int, not char. It has to be larger than char to be able to return a char as well as EOF.
I would probably code something along the lines of
void identifier(Parser *parser)
{
My_String_Builder *builder = get_string_builder(parser);
reset_string_builder(builder);
for (;;)
{
string_builder_push(builder, c);
if (! next_byte(parser))
break;
int c = parser_get(parser);
if (!(is_alpha(c) || c == '_' || ...))
break;
}
String *string = string_builder_finalize(builder);
// check string builder "overflow" / too long / further input sanitization
push_string_token(parser, TOKEN_IDENTIFIER, string);
}
If my C isn’t too rusty, that means the code can’t discriminate between hitting end of file and reading some byte value from the files (AFAIK EOF is -1 on most systems, so that value would often be 0xFF)
EOF being -1 means that if (*ptr == EOF) can never fire if *ptr can only take on the values 0..255 [i.e., char is unsigned] as opposed to -128..127 [i.e., char is signed]. Whether or not char is signed or unsigned is implementation-defined.
When fgetc succeeds, the int it returns has a value based on interpreting the byte as an unsigned char. On a platform with an 8-bit char, that means it’s in the range 0 to 255. This is required by the C spec regardless of whether char is signed or unsigned. Meanwhile, EOF is required to be negative. Thus, you can always distinguish the cases as long as you look at the original int return value rather than casting to char.
Fun fact: This approach causes trouble on obscure embedded platforms where char and int are the same size (and therefore an unsigned char value can’t fit inside a signed int). Such platforms are allowed by the C standard as freestanding implementations that don’t implement the full standard library, but they can’t conformantly implement fgetc. https://stackoverflow.com/questions/3860943/can-sizeofint-ev...
Every byte in a file, as represented in an int, takes on a value of 0..255. fgetc doesn't return a char, it returns an int, which means it returns a value of -1..255 (i.e., 257 possible values). If you try to represent the return value of fgetc as a char, two of those values get the same representation, namely 255 and -1. The difference between ARM and x86 is that on ARM (unsigned char), the -1-as-255 is represented as 255 when cast back to an int for comparison, whereas on x86, the 255 when cast back to an int is -1, so it would return the same value as EOF (although it would also do so had the value in the file originally been -1).
Representing binary data as unsigned char (as opposed to char or signed char) is the norm however.
This (distinguishing -1 from a character value when char is unsigned) is one of the reasons character literals in C have type int instead of char (C++ changed them to char.) Note the return type of getchar().
One thing I'm seeing: fgetc returns an int, so on a big-endian architecture the `*ptr = fgetc(src);` might yield even more interesting results.
Then, that fgetc() returns EOF on either an EOF or error is its major weakness — mixing in data and control all in-band — one is supposed to check with feof() and ferror() which of those two have happened, and if neither is true, then it's a data byte that just happens to be equal to the EOF constant. (This makes it quite pessimal when you are reading in a file that happens to contain lots of bytes equal to EOF, though).
> If instead you're on a system like ARM where char is unsigned, then you can never read EOF.
Then you should probably use an intermediate variable which is a proper int, gives you a surefire way to tell 0x000000ff and 0xffffffff apart.
But it's all gotchas like this which make me cringe every time someone suggests C is oh so very good and "simple", or even (gasp!) "convenient" to work with strings in general and text in particular. It's anything but. Hell, maybe it was better than alternatives back in some 1976, but then awk and Perl got invented, and Python followed soon after.
char cannot legally represent EOF (as you point out). This doesn't really have anything to do with ARM, though; for instance, the ARM compiler I use has command-line options and pragmas that control whether char is signed or unsigned. Char is just too small.
The type of 'out' is a different type than ssize_t, which may not matter for the practical ranges involved in parsing a file, but it certainly a bad sign.
Writes to memory it doesn't own if maxlen == 0 (unlikely, but who knows?)
Additionally: calling strchr on every input character is crazy. Frankly, the same applies to fgetc.
The retval is an int. It only gets treated as a char if it isn't EOF. The caller is supposed to know this and do the right thing rather than blindly downcasting. This is more apparent with the wide char library where there is a special wint_t with the sole purpose of serving as a return value that represents a wchar_t or WEOF.
AFAICS this is actually potential UB on most implementations: not only does fgetc() return -1 as an int in the failure case, it returns the byte as unsigned char in the success case. So on an implementation that defines plain char to have the same size and representation as signed char (as x86 does), reading a byte beyond SCHAR_MAX causes UB through signed overflow on the first line of the loop.
(This is a close relative of the well-known footgun that the is*() functions from <ctype.h> accept an unsigned char value as an int, and it’s completely valid—though rare in practice—for them to blow up when passed, say, "\xFF"[0] instead of ((const unsigned char *)"\xFF")[0] on an implementation with CHAR_BIT==8 and SCHAR_MIN<0.)
Overflow from signed arithmetic always results in UB. But an unsigned-to-signed conversion outside the range of the target type just results in implementation-defined behavior; see C17 §6.3.1.3 ("Signed and unsigned integers"). And most people these days (including me) take the simple approach of not supporting any implementation that doesn't just perform two's-complement wrapping for all conversions.
> And most people these days (including me) take the simple approach of not supporting any implementation that doesn't just perform two's-complement wrapping for all conversions.
Not quite; C23 requires a two's-complement representation, but it doesn't require two's-complement wrapping for unsigned-to-signed conversions. §6.3.1.3 hasn't been changed, so it's still left to the implementation to guarantee that. The only easily-visible effects of requiring a two's-complement representation are that X_MIN == -X_MAX - 1 for all signed integer types, and in general that the object representations of x and -x - 1 differ by only 1 bit, if the integer type has no padding bits.
Ah, so the “two’s complement from outer space” option (with X_MIN == -X_MAX and a trap representation in place of -XMAX-1) is also finally gone? I did not notice that, thanks.
C18 6.2.6.2p3 had:
> [It] is implementation-defined [...] whether the value with sign bit 1 and all value bits zero [...] is a trap representation or a normal value [for two’s complement].
GPUs notionally - I can't speak for truly "modern" GPUs but I recall older GPUs having language level "char"/"byte" implemented as a float (presumably with some minimal support to get expected semantics like clamping?)
Another bug: if `maxlen <= 0`, then it will overwrite the character before the start of `ptr`. And then there's the infinite loop if `maxlen > INT_MAX` on 64-bit machines. And maybe `out + skipped` can wrap as well?
One more: if an error occurs while reading the first character, the function returns -out = 0, which violates the documented behavior that errors are negative, and means a caller can’t distinguish between a failing to read and reading an empty file.
maxlen is meant to be a constant provided by a #define, so I wouldn't be worried about things going haywire if maxlen is a weird value.
But C-like code in general makes me nervous in parsers, since there's many ways things can go wrong and C is inherently fail-deadly if you make a mistake.
Years ago I ported some software from IRIX to some Linux-based OS. There were a few differences in the result. I eventually tracked it down to how on IRIX char is unsigned while the Linux-based OS used signed char.
The program used a 'char' to store the value of the formal atomic charge of an atom, which is typically 0, but for the sorts of chemistry I deal with can be +1, +2, -1, or -2, so it makes sense to allocate only 8 bits to store the 'char'ge. :)
On IRIX, where they had deployed the code for years, the charges were actually being interpreted as 0, +1, +2, 255, and 254.
Going back to the code example you gave, isn't there also an issue if read failure occurs on the first byte? Then `out` is 0, returning -0, which is 0, giving no way for the caller to distinguish between EOF and read failure.
That's a simple case, I've seen much worse because of endianess and different archs have different conventions for whether char is signed or unsigned by default (common these days with arm/X86). Any C programmer who's coded for X86 and ARM will know what the rules are and how not to break them (I'd like to hope, anyway).
I almost got it. I was like “whoa wait, does not fgetc return int? You truncate it to char, can you even compare it to EOF constant now?”. I did not remember the actual EOF constant or that char is unsigned on ARM, but I think I would pay attention to the compiler throwing a warning at me (it does throw a warning, does it not?)
And rightfully so. People who aren't afraid of them generally fail to understand all of the ways in which parsing can show fractal complexity, and will mostly stick to toy examples like this INI parser to justify their positions.
If you're gonna argue that parsing is simple, the bare minimum I'd want to see implemented is a context-sensitive grammar with unbounded lookaheads (or at the very least, that is capable of handling more than one token of lookahead), with proper support for Unicode, and actual error resilience (not what this article calls error resilience)
If you manage to do all that and can still call what you did "simple" without having completely deluded yourself, congratulations, I hope to be on your level some day.
PS1: I won't even go into the plethora of security issues originating from crappy parsers, especially those written in C
PS2: Let's also leave aside any matters related to correctness and validation of parsers, which are notoriously not by any means "simple".
UTF-8 was designed so that you don't have to worry about it. Supporting UTF-8 in a parser is trivial, basically just parse as if it were ASCII but don't barf on the bytes >= 128.
As long as all your delimiter chars are ASCII, it just works.
Errors in C are usually because of missing abstractions or the wrong approach. C gives you data layout, flow control, and functions, you can go a long long way with just that.
> unbounded lookaheads
If you want to require that, you get what you deserve. But implementing it is just a matter of putting a queue of tokens in front of your parser that supports look(n) separately from consume().
I see this type of sentiment a lot and I'm not sure why this exists. Maybe it's because there were a bunch of formats in the past and it made it more difficult? Idk.
Anyways, I finally decided to "bite the bullet" and prepared a solid week to finally do the "nitty gritty" of writing a UTF-8 validator/logging library. Turns out, it was super easy and took me like an hour to read through the RFC and maybe 2 more hours to write a simple implementation.
For anyone that's curious, give it a read here[0], it's surprisingly readable and the format is very simple and elegant. I don't say simple as in dumb either, I say simple as in they made the problem as simple as it needs to be with no unneeded complexity, and it's a breath of fresh air.
Also, it's written in such a way that any valid ASCII is valid UTF-8. So at the very least, you can just check if you encounter any bytes with the highest bit set in the string before parsing. If that's the case you can throw an error saying you don't support UTF-8 and avoid parsing potentially invalid data (not that it's particularly difficult to validate the UTF-8 if you want to).
Depends on where you draw the line of what a parser is:
- If the parser is "the thing that comes after the lexer" then all of this is abstracted away by the lexer and you can just treat it as a span of bytes;
- If the parser is "everything that needs to be implemented to correctly transduce the input sequence into a tree", then you need to implement this yourself or have a lexer that handles this for you, usually done by having a tiny UTF-8 codepoint recognizing FSM in your lexer (UTF-8 is a self-synchronizing code, which makes this part easier) and ignoring the existence of graphemes.
Most people, however, shy away from implementing a parser "all the way down to the bytes" and properly handling UTF-8 as a formal language. Most lean on a lexer abstracting this away. Ditto for context-sensitivity.
Recently Rust's regex engine underwent a major overhaul, and burntsushi wrote a blog post[0] about doing the "all the way to the bytes" thing in the new regex engine, I highly recommend the read:
Yeah I'm saying why does your lexer actually need to be UTF-8 aware? (An actual question, because maybe I'm not thinking of some obvious case.)
Most of the lexical/syntactic elements of languages are not in UTF-8. You're looking for things like semicolons and quotes and whitespace. If you don't change the language syntax/lexical elements so that those parts stay as the ASCII subset of UTF-8 then why does your lexer need to be aware of UTF-8? It can just accumulate everything else as bytes and it doesn't matter what format the bytes are. The parser and/or codegen will do equality checks for lookups later on but that doesn't need to be UTF-8 aware either?
If you don't want to error/warn on invalid UTF-8 but instead handle it with the "garbage in, garbage out" principle, then yes you're right, treating them pure byte streams works.
Yeah that makes sense. It doesn't really strike as the job of the compiler/parser to validate UTF-8. If you've got a messed up text editor/OS environment that's going to be a problem for lots of things.
What dezgeg said is pretty much spot on, and also I think what you're describing related to "compiling the codepoints down to bytes" is in many ways equivalent to handling the UTF-8.
My opinion is, stated in a way that a TigerBeetler will resonate with ;), is I want to be able to handle radioactive levels of corruption in my inputs, and still parse them without blowing up, and issuing great error messages along the way.
Ok, one reason I can think of why you'd want to be UTF-8 aware is so that your error messages at any part of the parser could point to the exact column in the line of text. The line number you could get without being UTF-8 aware. But the column number you couldn't get without being UTF-8 aware.
Yes whitespace in unicode is expansive. However, you could (and I assume most languages do) specify that a newline is \n or \r\n which are expressible in ASCII.
Maybe I'm wrong though, just an assumption about what's common.
There are also other concerns depending on your threat model: if you're parsing user-generated strings you definitely want to be able to handle corrupted unicode, for security reasons, and in these scenarios the way you handle recovery if you choose to do so may aggravate exploitation.
Consider a corrupted codepoint at the end of a user generated string: will it recognize the closing quote as such, or will it assume it is part of a corrupted codepoint and try to skip over it?
So many ways to shoot yourself in the foot by "abstracting away" the formal semantics of your inputs, I think it's pretty much never worth it. (An interesting search term here is LangSec)
> Consider a corrupted codepoint at the end of a user generated string: will it recognize the closing quote as such, or will it assume it is part of a corrupted codepoint and try to skip over it?
Maybe I'm misunderstanding you, but because of how UTF-8 is a superset of ASCII, I don't believe you can misrecognize ASCII characters if that's what you mean.
You are correct that this is a detectable and entirely preventable failure, however, this is the way in which this can manifest:
- UTF-8 is a prefix-free self-synchronizing code;
- If the first byte of a UTF-8 codepoint starts with 0b0??????? then it is ASCII, and all is well;
- If the leading byte of the codepoint is 0b110? it means there is one continuation byte to follow. If its 0b1110? there are two bytes to follow, and so on up to a maximum of 4 continuation bytes, which is the limit for UTF-8;
- All continuation bytes have the pattern 0b10? and UTF-8 self synchronizes based on detecting the leading byte;
- The correct way to parse UTF-8 is to not believe these lengths AT ALL and actually run the UTF-8 state-machine over the entire input, which can be made quite fast by leveraging bit-parallel techniques (see Daniel Lemire's work);
- The way you shoot yourself in the foot is by believing the length and skipping over those bytes: an attacker makes the last codepoint one that expects a single continuation byte but does not include the continuation byte, the fancy pantsy "optimized" parser will skip over the closing quote and decohere the parse. This is only safe to do on pre-validated input, but even then it's kind of not worth it if you have access to a SIMD accelerated UTF-8 validator
C programs start out in the "C" locale, so just using fgetwc() won't work out of the box (or won't do what you expect it to do). You'll need to call setlocale("") to get the expected behavior.
Agreed on all counts, except with the remark that even with sane grammars unbounded lookahead will still appear if you want to have IDE-grade error resilience.
But I wholeheartedly agree with the sentiment of "don't make the grammar look like Scala" <3
The “IDE-grade” error resilience can be approached in many ways, and unbounded lookahead is often unnecessary. The main problem you want to solve is the problem of recovering from an error, and parsing more of the file correctly after you encounter an error. One way you can do this is by finding statement or declaration boundaries, which can be done in surprisingly simple ways.
I mostly agree with this, but what I mean with the unbounded lookahead part of it is that bounding the amount of speculation (or lookaheads/backtracking) is equivalent to limiting the "size" of the error you can recover from.
You should definitely have bounds though, but the point is that if it's too low you might give up on the input too soon.
Sometimes you just haven't met the right abstraction yet. I'll be that guy talking about parser combinators hopefully before the rest of this thread fills up with them. I don't think my parser does everything in your bare minimum (I haven't really thought about utf-8!) but it does do some other pretty advanced stuff. For example it leans on white-space pretty hard to figure things out. No curly braces or semicolons, and parentheses are only for precedence, not function application.
What my parser does do:
* backtracking/alternatives
* Some context-sensitivity, in-so-far as it can tell a negate from a minus.
Where it got a little hard:
* I realised I was parsing division the wrong way. a/b/c/d became a/(b/(c/d)), not the other way around.
Where it got medium hard:
* Distinguishing unary minus from binary minus. I thought it would be really hard, but I only needed to look at the previous token to decide whether something was a TokNegate or a TokMinus.
Where it got hard:
* White-space/indentation sensitivity. I needed to first calculate the line-breaks and make that information (gathered during lexing) available during parsing.
Where it got really hard:
* LEARNING how to factor out the left-recursion. There were times when I literally thought it was impossible. I knew about 'precedence' in the back of my mind, but I didn't realise how the concept mapped to the code yet. By example: one sumExpression is (many or one productExpressions separated-by-'+'), and one productExpression is (many or one unaryExpression separated-by-'*'), and so on. You don't end up in an infinite-parse-loop if you try to parse the least-tightly-binding expressions first (which just seemed so counterintuitive that I guess I never tried?).
But I've yet to say why I like parser combinators so much (and think they're at least the 'simplest' way to do things, if not 'simple'):
You get to write code which looks like the bnf definition !
Just like TFA I'll take a lua example[1]
var ::= Name | prefixexp `[´ exp `]´ | prefixexp `.´ Name
I would code this something like:
var <- name <|> case2 <|> case 3
where
case2 = do
pe <- prefixexp
e <- char '[' *> exp <* char ']'
return (pe, e)
case3 = do
pe <- prefixexp
_ <- char '.'
n <- name
return (pe, n)
It more or less maps exactly onto the BNF, and in the above case, the extra complexity came from capturing the subexpressions and returning them to the caller. If I wrote a grammar to simply accept/deny its input (rather than trying to build an AST out of it), it could resemble the BNF even more:
Bnf definition vs. executable code:
var ::= Name | prefixexp `[´ exp `]´ | prefixexp `.´ Name
var = name <|> (prefixexp >> char '[' >> exp >> char ']') <|> (prefixexp >> char '.' >> name)
I will say one other thing about the simplicity, which is - I didn't use an existing parser combinator library. They're simple enough to roll your own. There's only one trap which I can think of, which is where to draw the line on automatic-backtracking. I.e. Should the caller explicitly need to insert 'try's to enable backtracking.
> Cute, now do it with UTF-8 support.
Ironically I think this is the one feature where I'd prefer to be in C. C's approach with bytes is perfectly forward-compatible. A higher-level language might be more opinionated about its String type (restricting what you can or can't accept with your parser) or have funny definitions about length().
But this implementation of a reduced feature set version of .ini file parsing does not convince me that I should write my own parser instead of using one that implements a more full feature set
> No newlines in keys, values, or section names. Empty values are not allowed. Comments only on their own lines (minus whitespace). Whitespace-insensitive (whitespace at the start of line, end of line, around the “=”, is all ignored). No need for a terminating newline either.
I think it's reasonable that people want to use a parser that has better error handling and gives an idea of where the ini file may have parsing problems than just a barebones implementation such as this provides.
I also think that using a library for parsing instead of writing your own parser does not imply that you are scared of parsing.
> But this implementation of a reduced feature set version of .ini file parsing does not convince me that I should write my own parser instead of using one that implements a more full feature set
A performance comparison on a large ini file might.
You may want to take the time to watch this video[0]. In it, Andreas Fredriksson walks though his reasoning for writing his own parser instead of using the standard json parser.
I don't think I've ever seen ungetc() used in production code.
I appreciate the "do it yourself" thing - but in 2023 unless you're building a product for a known ascii system, you're setting yourself up for pain when your code is run in San José,
ungetc() exists for exactly this application. But when you want more abstract input stream than FILE* you have to reimplement that yourself on higher layer. Ironically even libc implementation needs an more generic abstraction to implement scanf() and sscanf() (ie. you either have two separate implementations of these or don't internally use ungetc())
This looks interesting. I created and maintain a library for INI parsing that got surprisingly popular -- it's tiny, so is good for embedded systems. This API has a very similar feel to mine, including the callback for every key/value pair with a void* userdata. https://github.com/benhoyt/inih
I don't really understand this, I'm comfortable enough programming in C, but I find it the least suited to parsing behaviour compared to pretty much every other higher-level language. In particular, I find parser combinators (be it in lisps, MLs, haskell) to be infinitely more approachable and mentally managable than anything easily doable in C.
(Yes, you can do parser combinators in C, but it's very very ugly unless you stick to clang and enable block support).
If anything I'd expect to be told to use a high(er) level language for parsing as a front end to C code if needed.
If you just jumped to the code, it's in the first 2.5 sentences:
> People are terrified of parsers and parsing. To the point of using magical libraries with custom syntaxes to learn just to get started. In the hopes of completely shattering this preconception
Seems mostly to me about making parsing less "magical" for people who don't understand what's really going on.
It doesn’t; it’s a convention that the programmer is expected to honor. You could pass in a different function pointer instead (or something that isn’t a function pointer at all), and the behavior would be undefined.
Total hogwash. It's extremely easy to write parsers in a safe way. I've done a few and can do it in my sleep. Reading chars one by one in the lexer, reading tokens one by one in the parser. Pushing stuff on stacks. The "dangerous" parts can be reduced to a dozen lines.
Coding in C doesn't mean that you can't code function abstractions. Instead of storing things through pointers with pointer arithmetic and indexing all over the place, and doing manual bounds checks everywhere, you code a few functions like push_char(), push_token() etc.
True in theory, that is why given my Turbo Basic, Turbo Pascal, Modula-2 background, when I code in C, I do exactly that.
Follow pseudo ADT (Abstract Data Types) approach, with the classic modulename_function() pattern, everything that shouldn't be exposed is marked as static symbols on the implementation file, and for the few cases where there is a possible performance impact using ADTs, there is one or other macro.
However, that is not how most C developers program, regardless of how many books, ACCU and The C Programmers Journal articles, and conference talks on how to write proper C have been written.
The difficulty of implementing parsers (for the majority of programmers or engineers) is one of the tragedies of computer science. It is a very well developed theory and apparently it is too hard to deliver in a usable form without becoming one of those "to bake an apple pie you need to invent the universe" experiences. I just wanted a pie! Like the html-ish parser in Graphviz:
The trick for “lets hand write a parser, it is simple” is in coming up with grammar that is either LL(1) or slight superset of that that can still be parsed by recursive descent with hand rolled deconflicting rules (prime example of that is if-else in C). The issue with that is you cannot just take an arbitrary context-free grammar with attached semantics and mechanically transform it into LL(1) form, because while doing that you are moving production rules around and thus you lose any kind of semantics that were originally attached to them. This is the reason why the parser generators tend to be (LA)LR, you can produce such parser for wide (albeit in practice ill-defined) subset of context-free grammars purely mechanically without having to change the meaning of the attached semantic rules (typically represented as a bunch of copied code).
The thing is that parsing is a solved problem and has been for decades. But because this sounds like you must be incompetent if you can't do it, people get very defensive if you say it's easy. Everyone who believes that parsing is terrifying is highly invested in that belief. And it does take time to learn how parsers work. Write a parser, a parser generator, a regex engine, and then tell everyone else "it's easy" and see how that goes.
> I will write a parser for the “ini” file format in about 150 lines of pure and readable ISO C99
Sounds like overkill. Most often you don't need a full-fledged ini format, but just a list of "KEY=value" pairs that can be parsed with a single call to scanf.
scanf() is probably not the right approach to that, but still you can implement a parser for the common INI syntax in about a third of the code with fgets(), strchr()/str(c)spn and bunch of hand-rolled ad-hoc logic.
Agreed in principle - we shouldn't be afraid of token streams. Thinking in terms of JSON:
1. Its portable - JSON token streams look pretty similar in every language.
2. You're in control. Switching parsing libraries is painful when it's baked into your code (I'm currently weeding out a now unsupported parsing lib and it's painful).
3. It's flexible - try parsing heterogenous websocket JSON
streams with e.g. Swift's Codable. Possible: yes. Easy: no.
4. It's really fast.
If you're in control of the source and sink, the above probably doesn't apply - then it's trivial to make struct-based parsing fast and easy.
Ignoring the numerous correctness issues people have pointed out, parsing .ini simply isn't interesting as a thing to parse. In the sense that more or less any real data/source that you have to parse is more complex and structure than an ini so any "lesson" from parsing ini is inapplicable to anything else.
"Any more or less real programming language is more complex than an arithmetic expression language with only four operations so any 'lessons' from such a simple example is inapplicable to anything else."
(The fact that it silently clips the string if it's overlong is annoying, but it's not what I was thinking).
So EOF is defined as -1. Depending on how the environment chooses to define char, one of two things can happen. On systems like x86 where char is signed, it's impossible to read a byte value of 0xff, as it is confused with EOF. If instead you're on a system like ARM where char is unsigned, then you can never read EOF.