Hacker News new | past | comments | ask | show | jobs | submit login
White space does matter in C23 (gustedt.wordpress.com)
103 points by ingve on Jan 18, 2024 | hide | past | favorite | 52 comments



Whitespace in this context is used as a token separator. Where you place the space is of course important. --i is very different than - -i but the same as - -i.

When people say "whitespace matters", what it means is whether the meaning changes when you have more than 1 not more than 0.


Yeah, by TFA's definition whitespace matters in K&R C because of things like:

inta;

vs

int a;

To which I can just say "duh!"


Right, or various pathological examples involving comment delimiters being mistaken for divde and multiply operators


Yeah, this rage-inducing headline is completely unwarranted.


It’s not a token separator in the "#define foo(x)" example. You never get a "foo(" token, regardless of whitespace, but whitespace decides whether it’s a function macro definition or a constant definition. The standard needed to invent a new token called lparen, which is a left round bracket that is not preceded by whitespace, to make it make any sense.


Oh yeah, the C preprocessor is extremely mediocre, even compared to other text-based macro preprocessors


> When people say "whitespace matters", what it means is whether the meaning changes when you have more than 1 not more than 0.

I agree. After reading the blog post, it's clear that the author should have called it "syntax errors do matter in C23"


I think most programming languages have syntactically significant whitespace. I believe Fortran doesn't (or didn't), which helped a bug fly under the radar at NASA:

  DO 10 I=1.10
Which got interpreted as:

  DO10I = 1.10
Whereas the programmer wanted:

  DO 10 I=1,10
For a DO loop. Conversely, with SSW a language will evaluate these two expressions differently:

  inta = 10;
  int a = 10;


Algol 60 also allowed whitespace in variable names. But they had a solution to avoid Fortran's confusion: keywords had to be specially marked. https://en.wikipedia.org/wiki/Stropping_(syntax)


White space in variable names is a bad idea.

And not everything that is possible is worth doing; e.g. I once designed a language ("Leazy") where keywords don't have to be declared, and can be used as variable names, just to show I could still write an LL(1) recursive descent parser for it. You don't want that in anything for daily use, as it introduces confusion.


Even more fun is zero length names. In SQLite they didn't require table and column names to be at least one character, so you can do this:

    CREATE TABLE []([] []);
Which will create a table with zero length name containing one column with a zero length name and zero length type. And yes you can do all the regular SQL against them providing you quote the zero length name.


> where keywords […] can be used as variable names

PL/I has that, too, because the designers thought you couldn’t expect programmers to know all keywords. For PL/I, that’s a correct assumption. Implementations can have hundreds of keywords, and some of them are single-letter (http://bitsavers.trailing-edge.com/pdf/ibm/series1/GC34-0084... pages 19-25 mentions A, B, E, F, P, R, S, V and X)


Meanwhile both SQL (and more recently Python) have tokens that are keywords in certain contexts and regular identifiers in others.


Java too:

> A further ten character sequences are restricted keywords: open, module, requires, transitive, exports, opens, to, uses, provides, and with. These character sequences are tokenized as keywords solely where they appear as terminals in the ModuleDeclaration, ModuleDirective, and RequiresModifier productions (§7.7).

https://docs.oracle.com/javase/specs/jls/se12/html/jls-3.htm...


PL/I was infamous for it being possible to express valid code that read "IF IF THEN THEN ELSE ELSE".


C++ has some contextual keywords as well:

  final (C++11)
  override (C++11)
  import (C++20)
  module (C++20)


C# is probably the king of contextual keywords, they have tens of them. They started I believe with `yield return` and `yield break` and then added all of the LINQ ones (select, from, where etc), and kept adding others (notnull, record, required, etc).


But did they make those contextual in order to avoid conflicts with code that had existing identifiers before the new keyword was specified?


Yes.


I think the way F# implemented it is pretty good: if you want to use a keyword or whitespace in a variable or function name, it has to be enclosed in double backticks.

e.g:

  [<Property>]
  let ``Reverse of reverse of a list is the original list`` (xs:list<int>) =
      List.rev(List.rev xs) = xs


I'm not sure if F#'s implementation is good enough. The specification [1] suggests that it doesn't do any additional normalization to the resulting identifier, so otherewise identical identifiers with a single space, a single tab and two spaces would be different. I would expect them to collapse into a single space character.

[1] https://fsharp.org/specs/language-spec/4.1/FSharpSpec-4.1-la...


Why would you expect that? I would explicitly expect not that. Unicode normalisation (typically to NFC) maybe, but nothing more.


Mostly because it's rare to see identifiers with whitespace allowed after all. So this has to fulfill some specific needs.

The main use case here seems like self-describing properties that will be hardly referenced elsewhere. In principle this doesn't really need any new syntax, as you can put the description into the attribute (`[<Property>]` here) and the name itself can remain arbitrary or even be made anonymous. But we've got textual identifiers instead. So I guess that some code does refer to those textual identifiers, and if that's the case, being able to ignore invisible differences when comparing textual identifiers looks like a good idea as well.


We’d have long debates about spaces vs. tabs in identifiers. ;)


> White space in variable names is a bad idea.

In many contexts yes but as a blanket statement no. Calca allows whitespace in names and it is a delight to write as a result


> White space in variable names is a bad idea.

Pff. You can have your cake and eat it too: disallow whitespace in variable names except no-break space. ;)


And the same thing for filenames!

Writing shell script under the assumption that filenames do not contain spaces is a liberating experience. I want more of that! And it is nearly possible, by tr ' ' 0x00A0'ing every call to fopen, (probably as an option for mount).


Pfft, just make your syntax a prefix-free code.


Fortran '90 and later has some requirements for blanks, but a parser that also needs to be able to parse F'77 can't rely on them -- so I have to go out of the way to detect missing blanks and complain about them.

This feature makes some tokenization ambiguous without context -- is MODULEPROCEDUREFOO to be interpreted as "MODULE PROCEDUREFOO" or "MODULE PROCEDURE FOO"? But tokenization without any reserved words is a tricky problem anyway.


FORTRAN also had (has?) significant columns:

https://web.stanford.edu/class/me200c/tutorial_77/03_basics....


Would syntax highlighting have saved them here? That's how I usually spot any subtle typo bugs I write. Never seriously used fortran, but a line looking a little bit off usually draws my eye to it.


Hard to highlight a punched card.


R"(x)" literals are neat not just because whitespaces matter, but also because they are tokenized before macro expansion. Thus you can write a C23 detector like this:

   #include<stdio.h>

   #define r(R) R"()"

   int main()
   {
      puts(r()[0] ? "C99"  /* r() evaluates to "()" */
                  : "C23"  /* r() evaluates to "" */);
   }
Output: https://gcc.godbolt.org/z/Wj3s6KEGK

I have used that trick here:

https://www.ioccc.org/years.html#2015_yang

(C23 wasn't a thing back then, but the same trick can be used to differentiate C++11 from C++98).


This is checking for the presence of raw string literals (A GNU C extension) not C23. If you compile with `-std=gnu99` instead of `-std=c99` you'll get "C23" as output.


My bad, I just saw "R()" in the linked blog and thought the feature made it to C23, but looks like it's not standard.

https://en.cppreference.com/w/c/23

On the plus side, I now have a GNU extension detector.


The context is different standard versions. Random extensions don't count. C23 has raw string literals, C before 23 doesn't.


No it doesn't. If you don't specify a standard for GCC it uses GNU extensions by default.


Indeed, my bad :/


> C23 has raw string literals

Are you sure about that? I only see u, u8, U, and L defined as encoding-prefixes.


Eep, sorry. I misread the article.


> Thus you can write a C23 detector

Can't you just check `__STDC_VERSION__` ?

(Or, is your way a "But, where's the fun in that?" exercise? Actually, if you're an ioccc entrant, "where's the fun in that?" does suddenly become a leading hypothesis ;-)


And that manages to be the most intelligible part of prog.c


Another easily recognizable part is:

    x*=02//* */2
...which was used to differentiate C89 and anything later.

(More detailed explanation, if you can speak Japanese or tolerate machine translation, is available on: https://mame.github.io/ioccc-ja-spoilers/2015/yang.html )


That's clever!


> Generally, it is often assumed that in C spaces don’t contribute much to the interpretation of programming text

I can think of only one exception. In function-like macro definitions, the opening parenthesis `(` must directly follow the identifier. Though I guess the newline is significant in macro definitions in general, too.

Are there other places where white space matters?


Yes, in quite a few places.

Most pervasively, between any two letters and numbers. `unsignedint` is different from `unsigned int`, as is `inta` from `int a`. Similarly, `1 1` is not the same as `11`, `0x 1` is not `0x1`, and `0. 0` is not `0.0`.

`&&`, `||`, `<<`, `>>`, `++`, `--`, `//`, `/*`, `*/`, and all the `sign=` operators also don't allow whitespace.

New lines are significant for single-line comments (and macro definitions as you said).

Spaces are significant in escape sequences as well - "\n" is different from "\ n". Also related to string or char literals, newlines are also not allowed at all.

Spaces are also significant for an obscure feature which was officially removed in C23: digraphs and trigraphs. Before C23, sequences like `??/` (but not `? ? /`) were alternative ways of spelling many other characters. For example, this was a valid program:

  int main() ??<
    return 0;
  %>
This also could be detected, as `??/` represents the `\` character, which, if it appears at the end of a single line comment, makes the next line part of the comment.

So, the following function will tell you at runtime if the compiler supported trigraphs:

  int supportsTrigraphs() {
    // detector ??/
    return 0;
    return 1;
  }


There are some more examples:

• In the stringification via `#`, whitespaces are significant but any run will be converted to a single canonical space character.

• Macros can only be redefined when conflicting definitions are identical, where all whitespace separations are considered identical and trailing whitespaces are ignored. So you can put multiple copies of `#define FOO 3 + 4` with differing number of whitespaces, but `#define FOO 3+4` is not considered identical.

> Similarly, `1 1` is not the same as `11`, `0x 1` is not `0x1`, and `0. 0` is not `0.0`.

There is some interesting surprise here because the tokenization specifically looks for the preprocessing number (the `pp-number` non-terminal), which is a superset of both integers and floating points. There exist a set of preprocessing-only numerals, like `0xe+4`, which would be a syntax error once got past the preprocessor.

> Spaces are significant in escape sequences as well - "\n" is different from "\ n".

This is most visible when the space is between `\` and a newline. GCC and Clang accepts this as an extension but will issue the "backslash and newline separated by space" warning.

> Before C23, sequences like `??/` (but not `? ? /`) were alternative ways of spelling many other characters.

Note that digraphs (`%>` here) are still allowed. Trigraphs were problematic as they were very early textual replacements (even earlier than the tokenization!), while digraphs are just separate tokens with the same semantics.


Link gives me JSON, not HTML?

The JSON appears to mentions that this is a regression affecting `U"string"` where U is a macro (that expands to a string literal).

Obviously there are numerous examples of where whitespace always mattered even in prior versions.


... puts the C in Cthulhu.


You're damn straight I do!


Not in else case.


White Space Matters




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: