It is issues like this due to which I gave up on C++. There are so many ways to do something and every way is freaking wrong!
An acceptable solution is given at the end of the article:
> If you use the International Components for Unicode (ICU) library, you can use u_strToUpper and u_strToLower.
Makes you wonder why this isn't part of the C++ standard library itself. Every revision of the C++ standard brings with itself more syntax and more complexity in the language. But as a user of C++ I don't need more syntax and more complexity in the language. But I do need more standard library functions that solves these ordinary real-world programming problems.
I don't think it's a C++ problem. You just can't transform anything developed in "ancient" times to unicode aware in a single swoop.
On the other hand, libicu is 37MB by itself, so it's not something someone can write in a weekend and ship.
Any tool which is old enough will have a thousand ways to do something. This is the inevitability of software and programming languages. In the domain of C++, which has a size mammoth now, everyone expects this huge pony to learn new tricks, but everybody has a different idea of the "new tricks", so more features are added on top of its already impressive and very long list of features and capabilities.
You want libICU built-in? There must be other folks who want that too. So you may need to find them and work with them to make your dream a reality.
So, C++ is doing fine. It's not that they omitted Unicode during the design phase. Unicode has arrived later, and it has to be integrated by other means. This is what libraries for.
>You just can't transform anything developed in "ancient" times to unicode aware in a single swoop.
Even for Python it took well over a decade, and people still complain about the fact that they don't get to treat byte-sequences transparently as text any more - as if they want to wrestle with the `basestring` supertype, getting `UnicodeDecodeError` from an encoding operation or vice-versa, trying to guess the encoding of someone else's data instead of expecting it to be decoded on the other side....
But in C++ (and in C), you have the additional problem that the 8-bit integer type was named for the concept of a character of text, even though it clearly cannot actually represent any such thing. (Not to mention the whole bit about `char` being a separate type from both `signed char` and `unsigned char`, without defined signedness.)
Being developed in, and having to stay compatible with, ancient times is a real problem of C++.
The now-invalid assumptions couldn't have been avoided 50 years ago. Fixing them now in C++ is difficult or impossible, but still, the end result is a ton of brokenness baked into C++.
Languages developed in the 21st century typically have some at least half-decent Unicode support built-in. Unicode is big and complex, but there's a lot that a language can do to at least not silently destroy the encoding.
That explains why there are two functions, one for ascii and one for unicode. That doesn't explain why the unicode functions are hard to use (per the article).
Almost no programming language, perhaps other than Swift, solved that problem. Just use the article's examples as test cases. It's just as wrong as the C++ version in the article, except it's wrong with nicer syntax.
Python's strings have uppercase, lowercase and case-folding methods that don't choke on this. They don't use UTF-16 internally (they can use UCS-2 for strings whose code points will fit in that range; while a string might store code points from the surrogate-pair range, they're never interpreted as surrogate pairs, but instead as an error encoding so that e.g. invalid UTF-8 can be round-tripped) so they're never worried about surrogate pairs, and it knows a few things about localized text casing:
C#'s "ToUpper" takes an optional CultureInfo argument if you want to play around with how to treat different languages. Again, solved problem decades ago.
Well languages and conventions change. The € sign was added not that long ago and it was somewhat painful. The Chinese language uses a single character to refer to chemical elements so when IUPAC names new elements they will invent new characters. Etc.
there shouldn’t be an uppercase version of ß because there is no word in the german language that uses it as the first letter. the german language didnt think of allcaps. please correct me if I am wrong. If written in uppercase it should be converted to SZ or the new uppercase ß…. which my iphone doesn’t have… and
converting anything to uppercase SS isn’t something germany wants …
> there shouldn’t be an uppercase version of ß because there is no word in the german language that uses it as the first letter. the german language didnt think of allcaps.
Allcaps (and smallcaps) has always existed in signage everywhere. Before the computing age, letters where just arbitrary metal stamps -- and just whatever you could draw before that. Historically, language was not as standardized as it is today.
[Notice that this is in fact entirely impossible with the naive strategy since Greek cares about position of symbols]
Some of the latter examples aren't cases where a programming language or library should just "do the right thing" but cases of ambiguity where you need locale information to decide what's appropriate, which isn't "just as wrong as the C++ version" it's a whole other problem. It isn't wrong to capitalise A-acute as a capital A-acute, it's just not always appropriate depending on the locale.
For display it doesn't matter but most other applications really want some kind of normalizatin which does much much more so having a convenient to_lowercase() doesn't buy you as much as you think and can be actively misleading.
That doesn’t prevent adding a new function that converts an entire string to upper or lowercase in a Unicode aware way.
What would be wrong with adding new correct functions to the standard library to make this easy? There are already namespaces in C++ so you don’t even have to worry about collisions.
That’s the problem I see. It’s fine if you have a history of stuff that’s not that great in hindsight. But what’s wrong with having a better standard library going forward?
The reason that wasn't done is because Unicode is not really in older C++ standards. I think it may have been added to C++23 but I am not familiar with that. There are many partial solutions in older C++ but if you want to do it well then you need to get a library for it from somewhere, or else (possibly) wait for a new standard.
Unicode and character encodings are pretty esoteric. So are fonts. The stuff is technically everywhere and fundamental, but there are many encodings, technical details, etc. And most programmers only care about one language, or else only use UTF-8 with the most basic chars (the ones that agree with ASCII). That isn't terrible. You only need what you actually need. Most programs don't strictly have to be built for multiple random languages, and there is kind of a standard methodology to learn before you can do that.
I politely disagree. None of the programming languages which started integrating Unicode was targeting from bare metal to GUI, incl. embedded and OS development at the same time.
C++ has a great target area when compared to other programming languages. There are widely used libraries which compile correctly on PDP-11s, even if they are updated constantly.
You can't just say "I'll be just making everything Unicode aware, backwards compatibility be damned, eh".
But we don't have to make everything Unicode aware. Backward compatibility is indeed very important in C++. Like you rightly said, it still has to work for PDP-11 without breaking anything.
But the C++ overlords could always add a new type that is Unicode-aware. Converting one Unicode string to another is a purely in-memory, in-CPU operation. It does not need any I/O and it does not need any interaction with peripherals. So one can dream that such a type along with its conversion routines could be added to an updated standard library without breaking existing code that compiles correctly on PDP-11s.
> Converting one Unicode string to another is a purely in-memory, in-CPU operation.
...but it's a complex operation. This is what libICU is mostly for. You can't just look-up a single table and convert a string to another like you work on ASCII table or any other simple encoding.
Germans have their ß to S (or capital ß depending on the year), Turkish has ı/I and i/İ pairs, and tons of other languages have other rules.
Esp, this I/ı and İ/i pairs break tons of applications in very unexpected ways. I don't remember how many bugs I reported, and how many workarounds I have implemented in my systems.
Adding a type is nice, but the surrounding machinery is so big, it brings tons of work with itself. Unicode is such a complicated system, that I read that even you need two UTF-16 characters (4 bytes in total) to encode a single character. This is insane (as in complexity, I guess they have their reasons).
> Unicode is such a complicated system, that I read that even you need two UTF-16 characters (4 bytes in total) to encode a single character. This is insane (as in complexity, I guess they have their reasons).
Because there are more than 65,535 characters. That's just writing systems, not Unicode's fault. Most of the unnecessary complexity of Unicode is legacy compatibility: UTF-16 & UTF-32 are bad ideas that increase complexity, but they predate UTF-8 which actually works decently well so they get kept around for backwards compatibility. Likewise with the need for multiple normalization forms.
It's because Unicode don't allow for language switching.
It takes up to eight bytes per character in Unicode if you want to support both Chinese and Japanese in a single font using IVS(and I don't think there's any font that actually supports this).
AFAICS(As far as I can search), Simplified(PRC) and Traditional(Taiwan) Chinese encoding are respectively called GB2312 and Big5, and they're both two byte encodings with good practical coverage. Same applies for Japanese Shift_JIS. If e.g. :flag_cc: were allowed to be used as start-of-language marker, one could theoretically cut that back down to two bytes per character without losing much and actually improving language supports.
The number of characters is not the problem, the mess due to legacy compatibility is - case folding and normaltization could be much simpler if the codepoints were laid out with that in mind. Also the fact the Unicode can't make up its mind if it wants to encode glyphs (turkish I and i, han unification) or semantic characters (e.g. cyrillic vs. latin letters) or just "ideas" (emojis).
I mean, I already know some Unicode internals and linguistics (since I developed a language-specific compression algorithm back in the day), but I have never seen a single character requiring four bytes (and I know Emoji chaining for skin color, etc.).
So, seeing this just moved the complexity of Unicode one notch up in my head, and I respect the guys who designed and made it work. It was not whining or complaining of any sort. :)
Lots of emoji are outside the Basic Multilingual Plane and need 4 bytes in UTf-8 and UTF-16. That's without going into skin color and other modifiers and combinations.
Java was built from scratch as a heavy language with a whole portability layer that C++ does not have. Also, libraries have been around to do this stuff in C++ but maybe some people saw it better to not require C++ to support Unicode, presumably.
Until mid-2000s there was no certainty Unicode will eventually defeat competitors. In real it havenʼt fully yet - GB2312 and Tron are still locally prevailing, and IBM still jogs with EBCDIC. But at its early times nobody was reasonably sure, and Java attempt could have failed as well. (More so Java approach for UCS-2 was wrong - already commented nearby.)
Indeed, ICU as well, and then they all moved to UTF-16, which, again. in the long term lost to UTF-8. My point is that committing on a specific Unicode design 30 years ago was not, in retrospect, necessarily a good idea.
By not committing to UCS-2 early C++ left the road open to UTF-8. I'll concede that UTF8 has risen as the clear winner for more than a decade and C++ is well past the point that it should have at least basic builtin support. The problem is that there is at least one important C++ platform that only very recently added full support for the encoding in their native API.
"no excuse" -- I would respectfully disagree here. There are lots of very smart people who have worked on Qt. Really, some insanely good C++ programmers have worked on that project. I have no doubt that they have discussed changing class QString to use UTF-8 internally. To be clear, probably QChar would also need to change, or a new class (QChar8?) would be needed, in parallel to QChar. I guess they concluded the API breakage would be too severe. I assume Java and Win32/DotNet decided the same. Finally, you can Google for old mailing list discussions about QString using UTF-16. Many before have asked "can we just change to UTF-8?".
Java embraced Unicode, and ended up with a mess as Unicode changed underneath it.
You can actually end up in a cleaner state in C++, as there is no obligation to use the standard library string classes, but it's pretty much required in Java.
Java has 16-bit character types. It is in no way better at modern Unicode than C++ while being needlessly less efficient for mostly-ASCII text like XML-like markup.
> Any tool which is old enough will have a thousand ways to do something.
Only because of the strange desire of programmers to never stop. Not every program is a never ending story. Most are short stories their authors bludgeon into a novel.
Programming languages bloat into stupidity for the same reason. Nothing is ever removed. Programmers need editors.
So how do you design a language that accommodates both the people who need a codebase to be stable for decades and the people who want the bleeding edge all the time, backwards compatibility be damned?
You don't. Any language that tries to do both turns into an unusable abomination like C++. Good languages are stable and the bleeding edge is just the "new thing" and not necessarily better than the old thing.
> There are so many ways to do something and every way is freaking wrong!
That's life! The perfect way does not exist. The best you can do is be aware of the tradeoffs, and languages like C++ absolutely throw them in your face at every single opportunity. It's fatiguing, and writing in javascript or python allows us to uphold the facade that everything is okay and that we don't have to worry about a thing.
JS and Python are still old enough to have been created when Unicode was in its infancy, so they have their own share of problems from using UCS-2 (such as indexing strings by what is now a UTF-16 code unit, rather than by a codepoint or a grapheme cluster).
Swift has been developed in the modern times, and it's able to tackle Unicode properly, e.g. makes distinction between codepoints and grapheme clusters, and steers users away from random-access indexing and having a single (incorrect) notion of a string length.
Well, the only time you can do str lower where unicode locale awareness will be a problem is when you do it on the user input, like names.
How about you just dont? If it's a constant in your code, you probably use ASCII anyway or can do a static mapping. If it's user user input -- just don't str lower / str upper it.
Yes, except when it is not your choice. If the requirements are to display some strings in lower/uppercase then you need to find a way to do that. That doesn't have to be using the standard library though.
If anything it should be harder to add things to the language. Too many new additions have been half-arsed like and needed to be changed or deprecated soon after.
Yes, significantly smaller libraries had an hard time getting onto the standard. Getting the equivalent of ICU would be almost impossible. And good luck keeping it up to date.
> Makes you wonder why this isn't part of the C++ standard library itself.
Because the C++ standard library cares about binary size and backwards compatiblity, both of with are incompatible with a full Unicode implementation. Putting this in the stdlib means everyone has to pay for it even when you don't need it.
Libraries are fine, not everything needs to be defined by the language itself.
> Makes you wonder why this isn't part of the C++ standard library itself.
Plainly no need if there is a separate easily attachable library (and with permissible license). What C++ had to do - provide character (char{8,16,32}_t) and string types - it has done.
As a C++ dev, I have never run into the problem the post is describing. Upper and lowercase conversion has always worked just fine. Though then again, I don't fiddle with mixed unicode and non-unicode situations.
that is neither up-casing nor-downcasing, but (de)capitalization, which is a significantly more complex task (which ultimately requires up- or down-casing, but a whole lot more before then).
I am not aware of a Unicode concept of "the latin letter o followed by an apostrophe followed by another latin letter". Unicode would identify the glyphs for such a concept, but I don't see how Unicode is involved in this in anyway as far the process of deciding what "capitalized o'reilly" means.
The fact that the standard library works against you doesn't help (to_lower takes an int, but only kind of works (sometimes) correctly on unsigned char, and wchar_t is implicitly promoted to int).
to_lower is in the std namespace but is actually just part of the C89 standard, meaning it predates both UTF8 and UTF16. Is the alternative that it should be made unusable, and more existing code broken? A modern user has to include one of the c-prefix headers to use it, already hinting to them that 'here be dragons'.
But there are always dragons. It's strings. The mere assumption that they can be transformed int-by-int, irrespective of encoding, is wrong. As is the assumption that a sensible transformation to lower case without error handling exists.
> Is the alternative that it should be made unusable, and more existing code broken?
It should be marked [[deprecated]], yes. There is no good reason to use std::tolower/toupper anywhere - they can neither do unicode properly nor are they anywhere close to efficient for ASCII. And their behavior depends on the process-global locale.
An acceptable solution is given at the end of the article:
> If you use the International Components for Unicode (ICU) library, you can use u_strToUpper and u_strToLower.
Makes you wonder why this isn't part of the C++ standard library itself. Every revision of the C++ standard brings with itself more syntax and more complexity in the language. But as a user of C++ I don't need more syntax and more complexity in the language. But I do need more standard library functions that solves these ordinary real-world programming problems.