White space does matter in C23

halayli · on Jan 18, 2024

Whitespace in this context is used as a token separator. Where you place the space is of course important. --i is very different than - -i but the same as - -i.

When people say "whitespace matters", what it means is whether the meaning changes when you have more than 1 not more than 0.

aidenn0 · on Jan 19, 2024

Yeah, by TFA's definition whitespace matters in K&R C because of things like:

inta;

vs

int a;

To which I can just say "duh!"

seanhunter · on Jan 19, 2024

Right, or various pathological examples involving comment delimiters being mistaken for divde and multiply operators

klyrs · on Jan 19, 2024

Yeah, this rage-inducing headline is completely unwarranted.

GrumpySloth · on Jan 19, 2024

It’s not a token separator in the "#define foo(x)" example. You never get a "foo(" token, regardless of whitespace, but whitespace decides whether it’s a function macro definition or a constant definition. The standard needed to invent a new token called lparen, which is a left round bracket that is not preceded by whitespace, to make it make any sense.

aidenn0 · on Jan 19, 2024

Oh yeah, the C preprocessor is extremely mediocre, even compared to other text-based macro preprocessors

foofie · on Jan 19, 2024

> When people say "whitespace matters", what it means is whether the meaning changes when you have more than 1 not more than 0.

I agree. After reading the blog post, it's clear that the author should have called it "syntax errors do matter in C23"

tmtvl · on Jan 18, 2024

I think most programming languages have syntactically significant whitespace. I believe Fortran doesn't (or didn't), which helped a bug fly under the radar at NASA:

  DO 10 I=1.10

Which got interpreted as:

  DO10I = 1.10

Whereas the programmer wanted:

  DO 10 I=1,10

For a DO loop. Conversely, with SSW a language will evaluate these two expressions differently:

  inta = 10;
  int a = 10;

pdw · on Jan 18, 2024

Algol 60 also allowed whitespace in variable names. But they had a solution to avoid Fortran's confusion: keywords had to be specially marked. https://en.wikipedia.org/wiki/Stropping_(syntax)

jll29 · on Jan 18, 2024

White space in variable names is a bad idea.

And not everything that is possible is worth doing; e.g. I once designed a language ("Leazy") where keywords don't have to be declared, and can be used as variable names, just to show I could still write an LL(1) recursive descent parser for it. You don't want that in anything for daily use, as it introduces confusion.

rogerbinns · on Jan 18, 2024

Even more fun is zero length names. In SQLite they didn't require table and column names to be at least one character, so you can do this:

    CREATE TABLE []([] []);

Which will create a table with zero length name containing one column with a zero length name and zero length type. And yes you can do all the regular SQL against them providing you quote the zero length name.

Someone · on Jan 18, 2024

> where keywords […] can be used as variable names

PL/I has that, too, because the designers thought you couldn’t expect programmers to know all keywords. For PL/I, that’s a correct assumption. Implementations can have hundreds of keywords, and some of them are single-letter (http://bitsavers.trailing-edge.com/pdf/ibm/series1/GC34-0084... pages 19-25 mentions A, B, E, F, P, R, S, V and X)

nerdponx · on Jan 18, 2024

Meanwhile both SQL (and more recently Python) have tokens that are keywords in certain contexts and regular identifiers in others.

twic · on Jan 18, 2024

Java too:

> A further ten character sequences are restricted keywords: open, module, requires, transitive, exports, opens, to, uses, provides, and with. These character sequences are tokenized as keywords solely where they appear as terminals in the ModuleDeclaration, ModuleDirective, and RequiresModifier productions (§7.7).

https://docs.oracle.com/javase/specs/jls/se12/html/jls-3.htm...

Findecanor · on Jan 18, 2024

PL/I was infamous for it being possible to express valid code that read "IF IF THEN THEN ELSE ELSE".

layer8 · on Jan 18, 2024

C++ has some contextual keywords as well:

  final (C++11)
  override (C++11)
  import (C++20)
  module (C++20)

tsimionescu · on Jan 19, 2024

C# is probably the king of contextual keywords, they have tens of them. They started I believe with `yield return` and `yield break` and then added all of the LINQ ones (select, from, where etc), and kept adding others (notnull, record, required, etc).

wyldfire · on Jan 19, 2024

But did they make those contextual in order to avoid conflicts with code that had existing identifiers before the new keyword was specified?

layer8 · on Jan 19, 2024

julesnp · on Jan 19, 2024

I think the way F# implemented it is pretty good: if you want to use a keyword or whitespace in a variable or function name, it has to be enclosed in double backticks.

e.g:

  [<Property>]
  let ``Reverse of reverse of a list is the original list`` (xs:list<int>) =
      List.rev(List.rev xs) = xs

lifthrasiir · on Jan 19, 2024

I'm not sure if F#'s implementation is good enough. The specification [1] suggests that it doesn't do any additional normalization to the resulting identifier, so otherewise identical identifiers with a single space, a single tab and two spaces would be different. I would expect them to collapse into a single space character.

[1] https://fsharp.org/specs/language-spec/4.1/FSharpSpec-4.1-la...

chrismorgan · on Jan 19, 2024

Why would you expect that? I would explicitly expect not that. Unicode normalisation (typically to NFC) maybe, but nothing more.

lifthrasiir · on Jan 19, 2024

Mostly because it's rare to see identifiers with whitespace allowed after all. So this has to fulfill some specific needs.

The main use case here seems like self-describing properties that will be hardly referenced elsewhere. In principle this doesn't really need any new syntax, as you can put the description into the attribute (`[<Property>]` here) and the name itself can remain arbitrary or even be made anonymous. But we've got textual identifiers instead. So I guess that some code does refer to those textual identifiers, and if that's the case, being able to ignore invisible differences when comparing textual identifiers looks like a good idea as well.

layer8 · on Jan 18, 2024

We’d have long debates about spaces vs. tabs in identifiers. ;)

ketralnis · on Jan 19, 2024

> White space in variable names is a bad idea.

In many contexts yes but as a blanket statement no. Calca allows whitespace in names and it is a delight to write as a result

avgcorrection · on Jan 18, 2024

> White space in variable names is a bad idea.

Pff. You can have your cake and eat it too: disallow whitespace in variable names except no-break space. ;)

enriquto · on Jan 18, 2024

And the same thing for filenames!

Writing shell script under the assumption that filenames do not contain spaces is a liberating experience. I want more of that! And it is nearly possible, by tr ' ' 0x00A0'ing every call to fopen, (probably as an option for mount).

formerly_proven · on Jan 18, 2024

Pfft, just make your syntax a prefix-free code.

pklausler · on Jan 18, 2024

Fortran '90 and later has some requirements for blanks, but a parser that also needs to be able to parse F'77 can't rely on them -- so I have to go out of the way to detect missing blanks and complain about them.

This feature makes some tokenization ambiguous without context -- is MODULEPROCEDUREFOO to be interpreted as "MODULE PROCEDUREFOO" or "MODULE PROCEDURE FOO"? But tokenization without any reserved words is a tricky problem anyway.

actionfromafar · on Jan 18, 2024

FORTRAN also had (has?) significant columns:

https://web.stanford.edu/class/me200c/tutorial_77/03_basics....

extraduder_ire · on Jan 19, 2024

Would syntax highlighting have saved them here? That's how I usually spot any subtle typo bugs I write. Never seriously used fortran, but a line looking a little bit off usually draws my eye to it.

pklausler · on Jan 19, 2024

Hard to highlight a punched card.

omoikane · on Jan 18, 2024

R"(x)" literals are neat not just because whitespaces matter, but also because they are tokenized before macro expansion. Thus you can write a C23 detector like this:

   #include<stdio.h>

   #define r(R) R"()"

   int main()
   {
      puts(r()[0] ? "C99"  /* r() evaluates to "()" */
                  : "C23"  /* r() evaluates to "" */);
   }

Output: https://gcc.godbolt.org/z/Wj3s6KEGK

I have used that trick here:

https://www.ioccc.org/years.html#2015_yang

(C23 wasn't a thing back then, but the same trick can be used to differentiate C++11 from C++98).

defen · on Jan 18, 2024

This is checking for the presence of raw string literals (A GNU C extension) not C23. If you compile with `-std=gnu99` instead of `-std=c99` you'll get "C23" as output.

omoikane · on Jan 18, 2024

My bad, I just saw "R()" in the linked blog and thought the feature made it to C23, but looks like it's not standard.

https://en.cppreference.com/w/c/23

On the plus side, I now have a GNU extension detector.

Sharlin · on Jan 18, 2024

The context is different standard versions. Random extensions don't count. C23 has raw string literals, C before 23 doesn't.

defen · on Jan 18, 2024

No it doesn't. If you don't specify a standard for GCC it uses GNU extensions by default.

Sharlin · on Jan 20, 2024

Indeed, my bad :/

ksherlock · on Jan 18, 2024

> C23 has raw string literals

Are you sure about that? I only see u, u8, U, and L defined as encoding-prefixes.

Sharlin · on Jan 20, 2024

Eep, sorry. I misread the article.

Karellen · on Jan 19, 2024

> Thus you can write a C23 detector

Can't you just check `__STDC_VERSION__` ?

(Or, is your way a "But, where's the fun in that?" exercise? Actually, if you're an ioccc entrant, "where's the fun in that?" does suddenly become a leading hypothesis ;-)

silasdavis · on Jan 18, 2024

And that manages to be the most intelligible part of prog.c

lifthrasiir · on Jan 19, 2024

Another easily recognizable part is:

    x*=02//* */2

...which was used to differentiate C89 and anything later.

(More detailed explanation, if you can speak Japanese or tolerate machine translation, is available on: https://mame.github.io/ioccc-ja-spoilers/2015/yang.html )

rwbt · on Jan 18, 2024

That's clever!

jxy · on Jan 18, 2024

> Generally, it is often assumed that in C spaces don’t contribute much to the interpretation of programming text

I can think of only one exception. In function-like macro definitions, the opening parenthesis `(` must directly follow the identifier. Though I guess the newline is significant in macro definitions in general, too.

Are there other places where white space matters?

tsimionescu · on Jan 19, 2024

Yes, in quite a few places.

Most pervasively, between any two letters and numbers. `unsignedint` is different from `unsigned int`, as is `inta` from `int a`. Similarly, `1 1` is not the same as `11`, `0x 1` is not `0x1`, and `0. 0` is not `0.0`.

`&&`, `||`, `<<`, `>>`, `++`, `--`, `//`, `/*`, `*/`, and all the `sign=` operators also don't allow whitespace.

New lines are significant for single-line comments (and macro definitions as you said).

Spaces are significant in escape sequences as well - "\n" is different from "\ n". Also related to string or char literals, newlines are also not allowed at all.

Spaces are also significant for an obscure feature which was officially removed in C23: digraphs and trigraphs. Before C23, sequences like `??/` (but not `? ? /`) were alternative ways of spelling many other characters. For example, this was a valid program:

  int main() ??<
    return 0;
  %>

This also could be detected, as `??/` represents the `\` character, which, if it appears at the end of a single line comment, makes the next line part of the comment.

So, the following function will tell you at runtime if the compiler supported trigraphs:

  int supportsTrigraphs() {
    // detector ??/
    return 0;
    return 1;
  }

lifthrasiir · on Jan 19, 2024

There are some more examples:

• In the stringification via `#`, whitespaces are significant but any run will be converted to a single canonical space character.

• Macros can only be redefined when conflicting definitions are identical, where all whitespace separations are considered identical and trailing whitespaces are ignored. So you can put multiple copies of `#define FOO 3 + 4` with differing number of whitespaces, but `#define FOO 3+4` is not considered identical.

> Similarly, `1 1` is not the same as `11`, `0x 1` is not `0x1`, and `0. 0` is not `0.0`.

There is some interesting surprise here because the tokenization specifically looks for the preprocessing number (the `pp-number` non-terminal), which is a superset of both integers and floating points. There exist a set of preprocessing-only numerals, like `0xe+4`, which would be a syntax error once got past the preprocessor.

> Spaces are significant in escape sequences as well - "\n" is different from "\ n".

This is most visible when the space is between `\` and a newline. GCC and Clang accepts this as an extension but will issue the "backslash and newline separated by space" warning.

> Before C23, sequences like `??/` (but not `? ? /`) were alternative ways of spelling many other characters.

Note that digraphs (`%>` here) are still allowed. Trigraphs were problematic as they were very early textual replacements (even earlier than the tokenization!), while digraphs are just separate tokens with the same semantics.

o11c · on Jan 18, 2024

Link gives me JSON, not HTML?

The JSON appears to mentions that this is a regression affecting `U"string"` where U is a macro (that expands to a string literal).

Obviously there are numerous examples of where whitespace always mattered even in prior versions.

PaulHoule · on Jan 18, 2024

... puts the C in Cthulhu.

Whitespace · on Jan 18, 2024

You're damn straight I do!

downvotetruth · on Jan 18, 2024

Not in else case.

complianceowl · on Jan 18, 2024

White Space Matters