We seem to be talking past one another. What Id like to see banned is non-ascii ...

lifthrasiir · on Nov 10, 2021

Okay, I think I see where you got confused. There are multiple levels of Unicode identifier support and you are probably not aware of all possible levels. Those levels are:

1. Identifiers can contain any octet with the highest bit set. Different octet sequences denote different names.

2. Identifiers can contain any Unicode code point (or scalar value, the fine distinction is not required here) above U+007F. Different (but possibly same-looking) code point sequences denote different names.

3. Identifiers can contain any Unicode code point in a predefined set, or two if the first character and subsequent characters are distinguished. Different code point sequences denote different names.

4. Same to 3, but these predefined sets derive from the Unicode Identifier and Pattern Syntax specification [1]---namely (X)ID_Start/Continue.

5. Same to 4, but now identifiers are normalized according to one of the Unicode normalization algorithms. So some different code point sequences now map to the same name, but only if they are semantically same according to Unicode.

6. Same to 5, but also has a rule to reduce unwanted identifiers. This may include confusable characters, virtually indistinguishable names and names with multiple unrelated scripts. Unicode itself provides many guidelines in the Unicode Security Mechanisms standard [2].

Levels 3, 4 and 5 are most common choices in programming languages. In particular emojis are not allowed for 4, so your example wouldn't work in such languages. For example JavaScript is one of them so `eval('var \u{1f600} = 42')` doesn't work (where U+1F600 is a smiling face). Both Python and Rust are at the level 5. Possibly unexpectedly, both C and C++ are at the level 3. Levels 1 and 2 are rare especially in modern languages; PHP is a famous example of the level 1.

Level 6 is a complex topic and there are varying degrees of implementations (for example Rust partially supports the level 6 via lints), but there is a notable example outside of programming languages: the Internationalized Domain Names. They have very strong constraints because any pair of confusable labels is a security problem. It seems that they have been successful in keeping the security of non-ASCII domains on par with ASCII-only domains, that is, not fully satisfactory but reasonable enough. (If you don't see the security issues of ASCII-only domains, PaypaI and rnastercard are examples of problematic ASCII labels that were never forbidden.)

I argue that the level 3+ is necessary and the level 5+ is desirable for international audiences. The level 5 would for example mean that `var 안녕하세요 = "annyonghaseyo";` (Korean) is allowed but `var (emoji) = "oh no";` is forbidden. I have outlined why the former is required in the last paragraph of [3]. Does my clarified stance make sense to you?

[1] https://unicode.org/reports/tr31/

[2] https://unicode.org/reports/tr39/

[3] https://news.ycombinator.com/item?id=29170954

josteink · on Nov 10, 2021

To be clear I’m completely oblivious to what Unicode identifiers are. As such I’m not talking about them, and they are out of scope wrt to my point.

What I am advocating is that identifiers used for symbols in the programming language (variables-names, function-names, class-names, etc), should be strictly ASCII-based.

That’s simple, understandable and should be a sane default anywhere.

My opinion is that since nobody without a doctorate in Unicode actually fully understands Unicode, having a rule-set for identifiers built on top of the already bewildering Unicode rule-set is a sure-fire way to engineer for unexpected consequences and/or security issues.

Sure. Allow it if you must. But you must opt in to use it. It should be a non-default feature everywhere where it’s available.

lifthrasiir · on Nov 10, 2021

> That’s simple, understandable and should be a sane default anywhere.

This is an usual canned reason given to reject any internationalization efforts, and it is likely only "simple, understandable" and "a sane default anywhere" to people like you. As you didn't give why they are simple, understandable in general, I don't see how your arguments are universally applicable.

> My opinion is that since nobody without a doctorate in Unicode actually fully understands Unicode, having a rule-set for identifiers built on top of the already bewildering Unicode rule-set is a sure-fire way to engineer for unexpected consequences and/or security issues.

That can be said for about all security issues, not just Unicode. That doesn't make you to avoid writing anything, does it? For the record, it is a valid choice to not write anything, but we normally exclude that choice when we are talking about the technology. And the "bewildering Unicode rule-set" is an one-off thing, as it is not like that Unicode produces incompatible standards every year. (Python 3 adopted Unicode identifiers 14 years ago [1] and implementations never changed, only underlying databases have been updated.)

[1] https://www.python.org/dev/peps/pep-3131/