Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The wchar_t thing is made much worse by disagreements on what type that actually is. On Win32, it's a 16-bit type, guaranteed to be UTF-16 code points (or surrogate pairs). But on some other compilers and operating systems, wchar_t could be a 32-bit type.

Another problem with UTF-16 on Windows is that it does not enforce that surrogate pairs are properly matched. You can have valid filenames or passwords that cannot be encoded in UTF-8. The solution was to create another encoding system called "WTF-8" that allows unmatched surrogate pairs to survive a round trip to and from UTF-16.



WTF-8 barely qualifies as "another encoding system" - it's a trivial superset of UTF-8 that omits the rule forbidding surrogate codes.

Imo that artificial restriction in UTF-8 is the problem.


I think the problem is believing that one character set or character encoding is suitable for everything, and that it has one definition. Neither is true.

Sometimes the restriction is appropriate, but sometimes a variant without this restriction is appropriate, and sometimes Unicode is not appropriate at all. The "artificial restriction" in UTF-8 is legitimate (since they are not valid Unicode characters) but should not apply for all kinds of uses; the problem is programs that apply them when they shouldn't be applied because of limitations in the design.

I think that using a sequence of bytes as the file name and passwords is better, and that file names and passwords being case sensitive is also better.

However, I think "WTF-8" specifically means that mismatched surrogates can be encoded, in case you want to convert to/from invalid UTF-16. Sometimes you might use a different variant of UTF-8, that can go beyond the Unicode range, or encode null characters without null bytes, etc. Sometimes it is better to use different Unicode encodings, or different non-Unicode encodings (which cannot necessarily be converted to Unicode; don't assume that you can or should convert them), or to care only that it is ASCII (or any extension of ASCII without caring about specific extension it is), or to not care about character encoding at all.


Is this just a really good joke, or something real? I enjoyed it, regardless!


It was previously discussed at https://news.ycombinator.com/item?id=9611710




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: