Whitespace in this context is used as a token separator. Where you place the space is of course important. --i is very different than - -i but the same as - -i.
When people say "whitespace matters", what it means is whether the meaning changes when you have more than 1 not more than 0.
It’s not a token separator in the "#define foo(x)" example. You never get a "foo(" token, regardless of whitespace, but whitespace decides whether it’s a function macro definition or a constant definition. The standard needed to invent a new token called lparen, which is a left round bracket that is not preceded by whitespace, to make it make any sense.
I think most programming languages have syntactically significant whitespace. I believe Fortran doesn't (or didn't), which helped a bug fly under the radar at NASA:
DO 10 I=1.10
Which got interpreted as:
DO10I = 1.10
Whereas the programmer wanted:
DO 10 I=1,10
For a DO loop. Conversely, with SSW a language will evaluate these two expressions differently:
And not everything that is possible is worth doing; e.g. I once designed a language ("Leazy") where keywords don't have to be declared, and can be used as variable names, just to show I could still write an LL(1) recursive descent parser for it. You don't want that in anything for daily use, as it introduces confusion.
Even more fun is zero length names. In SQLite they didn't require table and column names to be at least one character, so you can do this:
CREATE TABLE []([] []);
Which will create a table with zero length name containing one column with a zero length name and zero length type. And yes you can do all the regular SQL against them providing you quote the zero length name.
> where keywords […] can be used as variable names
PL/I has that, too, because the designers thought you couldn’t expect programmers to know all keywords. For PL/I, that’s a correct assumption. Implementations can have hundreds of keywords, and some of them are single-letter (http://bitsavers.trailing-edge.com/pdf/ibm/series1/GC34-0084... pages 19-25 mentions A, B, E, F, P, R, S, V and X)
> A further ten character sequences are restricted keywords: open, module, requires, transitive, exports, opens, to, uses, provides, and with. These character sequences are tokenized as keywords solely where they appear as terminals in the ModuleDeclaration, ModuleDirective, and RequiresModifier productions (§7.7).
C# is probably the king of contextual keywords, they have tens of them. They started I believe with `yield return` and `yield break` and then added all of the LINQ ones (select, from, where etc), and kept adding others (notnull, record, required, etc).
I think the way F# implemented it is pretty good: if you want to use a keyword or whitespace in a variable or function name, it has to be enclosed in double backticks.
e.g:
[<Property>]
let ``Reverse of reverse of a list is the original list`` (xs:list<int>) =
List.rev(List.rev xs) = xs
I'm not sure if F#'s implementation is good enough. The specification [1] suggests that it doesn't do any additional normalization to the resulting identifier, so otherewise identical identifiers with a single space, a single tab and two spaces would be different. I would expect them to collapse into a single space character.
Mostly because it's rare to see identifiers with whitespace allowed after all. So this has to fulfill some specific needs.
The main use case here seems like self-describing properties that will be hardly referenced elsewhere. In principle this doesn't really need any new syntax, as you can put the description into the attribute (`[<Property>]` here) and the name itself can remain arbitrary or even be made anonymous. But we've got textual identifiers instead. So I guess that some code does refer to those textual identifiers, and if that's the case, being able to ignore invisible differences when comparing textual identifiers looks like a good idea as well.
Writing shell script under the assumption that filenames do not contain spaces is a liberating experience. I want more of that! And it is nearly possible, by tr ' ' 0x00A0'ing every call to fopen, (probably as an option for mount).
Fortran '90 and later has some requirements for blanks, but a parser that also needs to be able to parse F'77 can't rely on them -- so I have to go out of the way to detect missing blanks and complain about them.
This feature makes some tokenization ambiguous without context -- is MODULEPROCEDUREFOO to be interpreted as "MODULE PROCEDUREFOO" or "MODULE PROCEDURE FOO"? But tokenization without any reserved words is a tricky problem anyway.
Would syntax highlighting have saved them here? That's how I usually spot any subtle typo bugs I write. Never seriously used fortran, but a line looking a little bit off usually draws my eye to it.
R"(x)" literals are neat not just because whitespaces matter, but also because they are tokenized before macro expansion. Thus you can write a C23 detector like this:
#include<stdio.h>
#define r(R) R"()"
int main()
{
puts(r()[0] ? "C99" /* r() evaluates to "()" */
: "C23" /* r() evaluates to "" */);
}
This is checking for the presence of raw string literals (A GNU C extension) not C23. If you compile with `-std=gnu99` instead of `-std=c99` you'll get "C23" as output.
(Or, is your way a "But, where's the fun in that?" exercise? Actually, if you're an ioccc entrant, "where's the fun in that?" does suddenly become a leading hypothesis ;-)
> Generally, it is often assumed that in C spaces don’t contribute much to the interpretation of programming text
I can think of only one exception. In function-like macro definitions, the opening parenthesis `(` must directly follow the identifier. Though I guess the newline is significant in macro definitions in general, too.
Most pervasively, between any two letters and numbers. `unsignedint` is different from `unsigned int`, as is `inta` from `int a`. Similarly, `1 1` is not the same as `11`, `0x 1` is not `0x1`, and `0. 0` is not `0.0`.
`&&`, `||`, `<<`, `>>`, `++`, `--`, `//`, `/*`, `*/`, and all the `sign=` operators also don't allow whitespace.
New lines are significant for single-line comments (and macro definitions as you said).
Spaces are significant in escape sequences as well - "\n" is different from "\ n". Also related to string or char literals, newlines are also not allowed at all.
Spaces are also significant for an obscure feature which was officially removed in C23: digraphs and trigraphs. Before C23, sequences like `??/` (but not `? ? /`) were alternative ways of spelling many other characters. For example, this was a valid program:
int main() ??<
return 0;
%>
This also could be detected, as `??/` represents the `\` character, which, if it appears at the end of a single line comment, makes the next line part of the comment.
So, the following function will tell you at runtime if the compiler supported trigraphs:
• In the stringification via `#`, whitespaces are significant but any run will be converted to a single canonical space character.
• Macros can only be redefined when conflicting definitions are identical, where all whitespace separations are considered identical and trailing whitespaces are ignored. So you can put multiple copies of `#define FOO 3 + 4` with differing number of whitespaces, but `#define FOO 3+4` is not considered identical.
> Similarly, `1 1` is not the same as `11`, `0x 1` is not `0x1`, and `0. 0` is not `0.0`.
There is some interesting surprise here because the tokenization specifically looks for the preprocessing number (the `pp-number` non-terminal), which is a superset of both integers and floating points. There exist a set of preprocessing-only numerals, like `0xe+4`, which would be a syntax error once got past the preprocessor.
> Spaces are significant in escape sequences as well - "\n" is different from "\ n".
This is most visible when the space is between `\` and a newline. GCC and Clang accepts this as an extension but will issue the "backslash and newline separated by space" warning.
> Before C23, sequences like `??/` (but not `? ? /`) were alternative ways of spelling many other characters.
Note that digraphs (`%>` here) are still allowed. Trigraphs were problematic as they were very early textual replacements (even earlier than the tokenization!), while digraphs are just separate tokens with the same semantics.
When people say "whitespace matters", what it means is whether the meaning changes when you have more than 1 not more than 0.