Hacker News new | past | comments | ask | show | jobs | submit login

> It should be noted that Unicode uses 2 bytes for each character.

But you're programming "with Ubuntu", not Windows. IMHO you could safely assume/recommend UTF-8.




Just a reminder BTW that since version 2.0 (1996), Unicode is not an encoding scheme but a character set (I avoid the confusing “charset” word on purpose). Therefore, Unicode does not use any number of bytes: it only assigns code points to characters.

Windows used to use the UCS-2 encoding scheme which indeed used 2 bytes for each character, but since Windows 2000, it uses UTF-16 instead, which like UTF-8 uses a variable number of bytes per character.


Indeed. "Unicode" is an abstract character set, it doesn't "use" any bytes. A specific encoding does.


Even with UTF-16 that quote is incorrect, due to surrogate pairs. It's only correct for UCS-2, and even then, only if you take 'characters' to mean 'codepoints', and take 'Unicode' to mean 'a specific Unicode encoding'.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: