This demands the question - why allow control characters in file names at all? N...

fnj · on March 3, 2017

There simply is no good reason. I have had this discussion, and there is a group of people who consider it "unclean" to rule out any characters.

There is an excellent discussion of the topic[0]. I find it utterly definitive in the way it relentlessly shows how you can't "fix" this issue completely any other way than by ruling out the bad characters by making the kernel disallow them.

[0] https://www.dwheeler.com/essays/fixing-unix-linux-filenames....

eriknstr · on March 3, 2017

Ruling out "bad" characters is bound to affect internationalization negatively.

IMO the best approach would be to separate between file name and the file object. When I edit a file with vim, should vim really need to know the name of the file? No. Likewise for a lot of other utilities as well. If instead of being so focused on file names and paths everywhere and we operated instead mainly on inodes then I think much would have been won. Now in some instances the file name is of interest to the program itself, for example if you attach a file to an e-mail, upload it with a web browser, tar a directory, etc. but in all of these instances I think that the file name should be more separate and even for most programs that want the file name they should just treat the file name as a collection of bytes that have close to no meaning.

In other words, I would want to translate paths and file names into inodes in just a very select few places and then keep them separate.

This is what I am going to do in my Unix-derived operating system. I will get around to implementing said system probably never but you know, one can dream.

int_19h · on March 3, 2017

"Bad" characters in this context is control characters. So no, it would not affect internationalization at all.

webmaven · on March 5, 2017

> "Bad" characters in this context is control characters.

For sufficiently narrow definitions of "bad", sure.

It is probably a bad idea to allow mixed character-set filenames, as that allows homograph attacks[0], and there are other non-control characters like the zero-width space and it's brethren that should be disallowed across the board.

In English you probably also want to disallow ligature characters like ﬁ, ﬄ.[1]

There are other "good idea" limitations that may affect internationalization of various languages (not in terms of making it difficult, just constraining it, as in the above English ligature example).

For example, it is probably a good idea to disallow Hebrew diacritic symbols[2] like niqqud in filenames.

[0] https://en.wikipedia.org/wiki/IDN_homograph_attack

[1] https://en.wikipedia.org/wiki/Typographic_ligature#Ligatures...

[2] https://en.wikipedia.org/wiki/Hebrew_diacritics

int_19h · on March 5, 2017

> It is probably a bad idea to allow mixed character-set filenames

As someone who would be affected by this directly, I can tell you right away this rule would be a no-go. I plainly need the ability to mix Latin and Cyrillic characters in my filenames. A filesystem or OS that wouldn't let me do so wouldn't even be considered.

A very simple rule of thumb is, if it is a title of a book (or a song, or a film etc), it should also be a valid filename.

webmaven · on March 5, 2017

Interesting. Are those filenames used in contexts where homographs could be more than a minor annoyance?

int_19h · on March 5, 2017

BTW, whether a character is or isn't a homograph depends very much on the font. For example, Cyrillic letter 'и' has no obvious visual counterpart in Latin... but as soon as you use cursive, it becomes 'и'. Which is visually indistinguishable from cursive 'u': 'u'.

Same thing with the letter 'т': in cursive, it becomes 'т', which in many (but not all) fonts looks the same as cursive 'm'.

int_19h · on March 5, 2017

I guess? If a malicious attacker could gain access to my FS and create homographs, figuring out which is which while browsing the filesystem would be non-trivial.

But I find it an unlikely attack vector to begin with. The main concern with homographs is in URLs and other external resources.

webmaven · on March 9, 2017

I guess I am thinking of contexts like uploaded files, or shared network drives.

eriknstr · on March 3, 2017

I missed the portion where that was said. I read through all of the comments about this and replied to the last one thinking we were still talking about things like spaces and newlines. My comment would have been better placed as a reply elsewhere. Sorry about that.

Still though, even if we only block some control characters, doing so could lead to problems with future character encodings.

Personally I hope UTF-8 / UTF-16 / UTF-32 is the final set of character encodings but we can't know that it will be.