It is issues like this due to which I gave up on C++. There are so many ways to ...

bayindirh · 2024-10-08T09:21:42 1728379302

I don't think it's a C++ problem. You just can't transform anything developed in "ancient" times to unicode aware in a single swoop.

On the other hand, libicu is 37MB by itself, so it's not something someone can write in a weekend and ship.

Any tool which is old enough will have a thousand ways to do something. This is the inevitability of software and programming languages. In the domain of C++, which has a size mammoth now, everyone expects this huge pony to learn new tricks, but everybody has a different idea of the "new tricks", so more features are added on top of its already impressive and very long list of features and capabilities.

You want libICU built-in? There must be other folks who want that too. So you may need to find them and work with them to make your dream a reality.

So, C++ is doing fine. It's not that they omitted Unicode during the design phase. Unicode has arrived later, and it has to be integrated by other means. This is what libraries for.

zahlman · 2024-10-08T18:40:02 1728412802

>You just can't transform anything developed in "ancient" times to unicode aware in a single swoop.

Even for Python it took well over a decade, and people still complain about the fact that they don't get to treat byte-sequences transparently as text any more - as if they want to wrestle with the `basestring` supertype, getting `UnicodeDecodeError` from an encoding operation or vice-versa, trying to guess the encoding of someone else's data instead of expecting it to be decoded on the other side....

But in C++ (and in C), you have the additional problem that the 8-bit integer type was named for the concept of a character of text, even though it clearly cannot actually represent any such thing. (Not to mention the whole bit about `char` being a separate type from both `signed char` and `unsigned char`, without defined signedness.)

pornel · 2024-10-08T09:33:33 1728380013

Being developed in, and having to stay compatible with, ancient times is a real problem of C++.

The now-invalid assumptions couldn't have been avoided 50 years ago. Fixing them now in C++ is difficult or impossible, but still, the end result is a ton of brokenness baked into C++.

Languages developed in the 21st century typically have some at least half-decent Unicode support built-in. Unicode is big and complex, but there's a lot that a language can do to at least not silently destroy the encoding.

cm2187 · 2024-10-08T10:18:58 1728382738

That explains why there are two functions, one for ascii and one for unicode. That doesn't explain why the unicode functions are hard to use (per the article).

BoringTimesGang · 2024-10-08T10:55:24 1728384924

Because human language is hard to boil down to a simple computing model and the problem is underdefined, based on naive assumptions.

Or perhaps I should say naïve.

cm2187 · 2024-10-08T18:11:15 1728411075

Well pretty much every other more recent language solved that problem.

kccqzy · 2024-10-08T18:24:31 1728411871

Almost no programming language, perhaps other than Swift, solved that problem. Just use the article's examples as test cases. It's just as wrong as the C++ version in the article, except it's wrong with nicer syntax.

zahlman · 2024-10-08T18:46:01 1728413161

Python's strings have uppercase, lowercase and case-folding methods that don't choke on this. They don't use UTF-16 internally (they can use UCS-2 for strings whose code points will fit in that range; while a string might store code points from the surrogate-pair range, they're never interpreted as surrogate pairs, but instead as an error encoding so that e.g. invalid UTF-8 can be round-tripped) so they're never worried about surrogate pairs, and it knows a few things about localized text casing:

    >>> 'ß'.upper()
    'SS'
    >>> 'ß'.lower()
    'ß'
    >>> 'ß'.casefold()
    'ss'

There are a lot of really complicated tasks for Unicode strings. String casing isn't really one of them.

(No, Python can't turn 'SS' back into 'ß'. But doing that requires metadata about language that a string simply doesn't represent.)

crote · 2024-10-09T08:07:53 1728461273

But that's wrong. The uppercase for "in Maßen" ("in moderate amounts") is not "IN MASSEN" ("in Massen", meaning "in massive amounts").

kccqzy · 2024-10-08T19:04:40 1728414280

Still breaks on, for example, Turkish i vs İ. It's impossible to do correctly without language information.

> (No, Python can't turn 'SS' back into 'ß'. But doing that requires metadata about language that a string simply doesn't represent.)

Yes that's my point. Because in typical languages strings don't store language metadata, this is impossible to do correctly in general.

zahlman · 2024-10-08T19:08:57 1728414537

I'm not seeing anything in the Swift documentation about strings carrying language metadata, either, though?

kccqzy · 2024-10-08T19:20:54 1728415254

This lowercase function takes a locale argument https://developer.apple.com/documentation/foundation/nsstrin...

It looks like an old NSString method that's available in both Obj-C and Swift.

The casefold function is even older than that. https://developer.apple.com/documentation/foundation/nsstrin... Its documentation specifically includes a discussion of the Turkish İ/I issue.

tedunangst · 2024-10-08T19:18:53 1728415133

But that's wrong. The upper case for ß is ẞ.

cm2187 · 2024-10-08T20:27:23 1728419243

C#'s "ToUpper" takes an optional CultureInfo argument if you want to play around with how to treat different languages. Again, solved problem decades ago.

account42 · 2024-10-09T13:14:59 1728479699

This is not a locale issue, it's a Unicode version issue. Which hightlights another problem with adding this to the base standard library.

IncreasePosts · 2024-10-08T20:15:24 1728418524

That was only adopted in Germany like 7 years ago!

kccqzy · 2024-10-08T21:10:48 1728421848

Well languages and conventions change. The € sign was added not that long ago and it was somewhat painful. The Chinese language uses a single character to refer to chemical elements so when IUPAC names new elements they will invent new characters. Etc.

extraduder_ire · 2024-10-09T01:10:28 1728436228

Does unicode have space set aside for those new symbols to slot into? I know it's very rare, but it could get messy.

account42 · 2024-10-09T13:16:52 1728479812

Unicode is already messy. Chinese characters especially so due to han unificiation.

Towaway69 · 2024-10-09T06:29:13 1728455353

Isn't uppercase for ß just ß - i.e. it's its own uppercase character?

bratwurst3000 · 2024-10-09T12:59:18 1728478758

there shouldn’t be an uppercase version of ß because there is no word in the german language that uses it as the first letter. the german language didnt think of allcaps. please correct me if I am wrong. If written in uppercase it should be converted to SZ or the new uppercase ß…. which my iphone doesn’t have… and converting anything to uppercase SS isn’t something germany wants …

account42 · 2024-10-09T13:22:52 1728480172

> there shouldn’t be an uppercase version of ß because there is no word in the german language that uses it as the first letter. the german language didnt think of allcaps.

Allcaps (and smallcaps) has always existed in signage everywhere. Before the computing age, letters where just arbitrary metal stamps -- and just whatever you could draw before that. Historically, language was not as standardized as it is today.

Towaway69 · 2024-10-09T21:47:10 1728510430

I don’t think that Germany wants a capital ß or the German language requires one rather technology needs one to dot the eyes and cross the tees.

account42 · 2024-10-09T13:18:42 1728479922

Not generally no, but some applications used it that way because of ambiguity of upppercasing ß to SS - which is why ẞ was added.

Towaway69 · 2024-10-09T21:43:30 1728510210

On the other hand, the German language has existed for several hundred years without having a capital ß but now it needs one?

True capitalisation has always existed but even that didn’t seem to have required a capital ß - why now?

tialaramex · 2024-10-08T22:00:38 1728424838

Rust will cheerfully:

    assert_eq!("ὀδυσσεύς", "ὈΔΥΣΣΕΎΣ".to_lowercase());

[Notice that this is in fact entirely impossible with the naive strategy since Greek cares about position of symbols]

Some of the latter examples aren't cases where a programming language or library should just "do the right thing" but cases of ambiguity where you need locale information to decide what's appropriate, which isn't "just as wrong as the C++ version" it's a whole other problem. It isn't wrong to capitalise A-acute as a capital A-acute, it's just not always appropriate depending on the locale.

account42 · 2024-10-09T13:27:33 1728480453

Is this

    assert_eq!("\u1F41δυσσεύς", "ὈΔΥΣΣΕΎΣ".to_lowercase());

or

    assert_eq!("\u03BF\u0314δυσσεύς", "ὈΔΥΣΣΕΎΣ".to_lowercase());

For display it doesn't matter but most other applications really want some kind of normalizatin which does much much more so having a convenient to_lowercase() doesn't buy you as much as you think and can be actively misleading.

MBCook · 2024-10-08T22:36:17 1728426977

So what?

That doesn’t prevent adding a new function that converts an entire string to upper or lowercase in a Unicode aware way.

What would be wrong with adding new correct functions to the standard library to make this easy? There are already namespaces in C++ so you don’t even have to worry about collisions.

That’s the problem I see. It’s fine if you have a history of stuff that’s not that great in hindsight. But what’s wrong with having a better standard library going forward?

It’s not like this is an esoteric thing.

wakawaka28 · 2024-10-09T00:24:20 1728433460

The reason that wasn't done is because Unicode is not really in older C++ standards. I think it may have been added to C++23 but I am not familiar with that. There are many partial solutions in older C++ but if you want to do it well then you need to get a library for it from somewhere, or else (possibly) wait for a new standard.

Unicode and character encodings are pretty esoteric. So are fonts. The stuff is technically everywhere and fundamental, but there are many encodings, technical details, etc. And most programmers only care about one language, or else only use UTF-8 with the most basic chars (the ones that agree with ASCII). That isn't terrible. You only need what you actually need. Most programs don't strictly have to be built for multiple random languages, and there is kind of a standard methodology to learn before you can do that.

account42 · 2024-10-09T13:11:59 1728479519

No, strong backwards compatiblity a real strength of C++. In fact, it's probably it's main strength these days.

relaxing · 2024-10-08T09:44:57 1728380697

It’s been 30 years. Unicode predates C++98. Java saw the writing on the wall. There’s no excuse.

bayindirh · 2024-10-08T09:52:47 1728381167

> There’s no excuse.

I politely disagree. None of the programming languages which started integrating Unicode was targeting from bare metal to GUI, incl. embedded and OS development at the same time.

C++ has a great target area when compared to other programming languages. There are widely used libraries which compile correctly on PDP-11s, even if they are updated constantly.

You can't just say "I'll be just making everything Unicode aware, backwards compatibility be damned, eh".

blenderob · 2024-10-08T09:56:48 1728381408

But we don't have to make everything Unicode aware. Backward compatibility is indeed very important in C++. Like you rightly said, it still has to work for PDP-11 without breaking anything.

But the C++ overlords could always add a new type that is Unicode-aware. Converting one Unicode string to another is a purely in-memory, in-CPU operation. It does not need any I/O and it does not need any interaction with peripherals. So one can dream that such a type along with its conversion routines could be added to an updated standard library without breaking existing code that compiles correctly on PDP-11s.

bayindirh · 2024-10-08T10:09:50 1728382190

> Converting one Unicode string to another is a purely in-memory, in-CPU operation.

...but it's a complex operation. This is what libICU is mostly for. You can't just look-up a single table and convert a string to another like you work on ASCII table or any other simple encoding.

Germans have their ß to S (or capital ß depending on the year), Turkish has ı/I and i/İ pairs, and tons of other languages have other rules.

Esp, this I/ı and İ/i pairs break tons of applications in very unexpected ways. I don't remember how many bugs I reported, and how many workarounds I have implemented in my systems.

Adding a type is nice, but the surrounding machinery is so big, it brings tons of work with itself. Unicode is such a complicated system, that I read that even you need two UTF-16 characters (4 bytes in total) to encode a single character. This is insane (as in complexity, I guess they have their reasons).

SAI_Peregrinus · 2024-10-08T13:56:34 1728395794

> Unicode is such a complicated system, that I read that even you need two UTF-16 characters (4 bytes in total) to encode a single character. This is insane (as in complexity, I guess they have their reasons).

Because there are more than 65,535 characters. That's just writing systems, not Unicode's fault. Most of the unnecessary complexity of Unicode is legacy compatibility: UTF-16 & UTF-32 are bad ideas that increase complexity, but they predate UTF-8 which actually works decently well so they get kept around for backwards compatibility. Likewise with the need for multiple normalization forms.

numpad0 · 2024-10-09T00:07:25 1728432445

It's because Unicode don't allow for language switching.

It takes up to eight bytes per character in Unicode if you want to support both Chinese and Japanese in a single font using IVS(and I don't think there's any font that actually supports this).

AFAICS(As far as I can search), Simplified(PRC) and Traditional(Taiwan) Chinese encoding are respectively called GB2312 and Big5, and they're both two byte encodings with good practical coverage. Same applies for Japanese Shift_JIS. If e.g. :flag_cc: were allowed to be used as start-of-language marker, one could theoretically cut that back down to two bytes per character without losing much and actually improving language supports.

account42 · 2024-10-09T13:55:06 1728482106

The number of characters is not the problem, the mess due to legacy compatibility is - case folding and normaltization could be much simpler if the codepoints were laid out with that in mind. Also the fact the Unicode can't make up its mind if it wants to encode glyphs (turkish I and i, han unification) or semantic characters (e.g. cyrillic vs. latin letters) or just "ideas" (emojis).

bayindirh · 2024-10-08T14:06:49 1728396409

I mean, I already know some Unicode internals and linguistics (since I developed a language-specific compression algorithm back in the day), but I have never seen a single character requiring four bytes (and I know Emoji chaining for skin color, etc.).

So, seeing this just moved the complexity of Unicode one notch up in my head, and I respect the guys who designed and made it work. It was not whining or complaining of any sort. :)

fluoridation · 2024-10-08T19:06:42 1728414402

Cuneiform codepoints are 17 bits long. If you're using UTF-16 you'll need two code units to represent a character.

gpderetta · 2024-10-09T12:25:20 1728476720

you also need two UTF16 code units for plain emojis.

TorKlingberg · 2024-10-09T13:53:27 1728482007

Lots of emoji are outside the Basic Multilingual Plane and need 4 bytes in UTf-8 and UTF-16. That's without going into skin color and other modifiers and combinations.

account42 · 2024-10-09T13:45:29 1728481529

> Germans have their ß to S (or capital ß depending on the year)

FYI, it's never S. If there is no better option then SS and ss are the proper capital and lowercase substitutions.

blenderob · 2024-10-08T10:17:08 1728382628

Thanks for the reply! Really appreciate the time you have taken to write down a thoughtful reply.

bayindirh · 2024-10-08T10:34:46 1728383686

No problems! If you want a slightly longer write-up, here's a classic I constantly share with people:

https://blog.codinghorror.com/whats-wrong-with-turkey/

wakawaka28 · 2024-10-09T00:28:00 1728433680

Java was built from scratch as a heavy language with a whole portability layer that C++ does not have. Also, libraries have been around to do this stuff in C++ but maybe some people saw it better to not require C++ to support Unicode, presumably.

Netch · 2024-10-09T15:24:47 1728487487

> There’s no excuse.

Until mid-2000s there was no certainty Unicode will eventually defeat competitors. In real it havenʼt fully yet - GB2312 and Tron are still locally prevailing, and IBM still jogs with EBCDIC. But at its early times nobody was reasonably sure, and Java attempt could have failed as well. (More so Java approach for UCS-2 was wrong - already commented nearby.)

gpderetta · 2024-10-08T19:53:37 1728417217

Java ended up picking UCS-2 and getting screwed.

throwaway2037 · 2024-10-09T09:53:42 1728467622

Pretty much all Unicode early adopters went for 16-bit chars. Qt and Win32 API are another pair.

gpderetta · 2024-10-09T12:21:21 1728476481

Indeed, ICU as well, and then they all moved to UTF-16, which, again. in the long term lost to UTF-8. My point is that committing on a specific Unicode design 30 years ago was not, in retrospect, necessarily a good idea.

By not committing to UCS-2 early C++ left the road open to UTF-8. I'll concede that UTF8 has risen as the clear winner for more than a decade and C++ is well past the point that it should have at least basic builtin support. The problem is that there is at least one important C++ platform that only very recently added full support for the encoding in their native API.

account42 · 2024-10-09T14:13:44 1728483224

Qt really has no excuse for still using 16-bit characters since unlike the other two they have had multiple ABI breaks since then.

throwaway2037 · 2024-10-10T07:37:59 1728545879

"no excuse" -- I would respectfully disagree here. There are lots of very smart people who have worked on Qt. Really, some insanely good C++ programmers have worked on that project. I have no doubt that they have discussed changing class QString to use UTF-8 internally. To be clear, probably QChar would also need to change, or a new class (QChar8?) would be needed, in parallel to QChar. I guess they concluded the API breakage would be too severe. I assume Java and Win32/DotNet decided the same. Finally, you can Google for old mailing list discussions about QString using UTF-16. Many before have asked "can we just change to UTF-8?".

account42 · 2024-10-10T07:48:58 1728546538

Ah yes, appeal to authority. No better way to admit that you are talking out of your arse.

nitwit005 · 2024-10-10T06:39:24 1728542364

Java embraced Unicode, and ended up with a mess as Unicode changed underneath it.

You can actually end up in a cleaner state in C++, as there is no obligation to use the standard library string classes, but it's pretty much required in Java.

account42 · 2024-10-09T13:43:33 1728481413

Java has 16-bit character types. It is in no way better at modern Unicode than C++ while being needlessly less efficient for mostly-ASCII text like XML-like markup.

akira2501 · 2024-10-08T18:34:05 1728412445

> libicu is 37MB by itself, so it's not something someone can write in a weekend and ship.

Isn't that mostly just from tables derived from the Unicode standard?

ectospheno · 2024-10-08T18:45:48 1728413148

> Any tool which is old enough will have a thousand ways to do something.

Only because of the strange desire of programmers to never stop. Not every program is a never ending story. Most are short stories their authors bludgeon into a novel.

Programming languages bloat into stupidity for the same reason. Nothing is ever removed. Programmers need editors.

fluoridation · 2024-10-08T19:01:41 1728414101

So how do you design a language that accommodates both the people who need a codebase to be stable for decades and the people who want the bleeding edge all the time, backwards compatibility be damned?

the_gorilla · 2024-10-08T19:03:49 1728414229

You don't. Any language that tries to do both turns into an unusable abomination like C++. Good languages are stable and the bleeding edge is just the "new thing" and not necessarily better than the old thing.

fluoridation · 2024-10-08T19:07:49 1728414469

C++ doesn't try to do that. It aims to remain as backwards compatible as possible, which is what the GP is complaining about.

pistoleer · 2024-10-08T09:20:52 1728379252

> There are so many ways to do something and every way is freaking wrong!

That's life! The perfect way does not exist. The best you can do is be aware of the tradeoffs, and languages like C++ absolutely throw them in your face at every single opportunity. It's fatiguing, and writing in javascript or python allows us to uphold the facade that everything is okay and that we don't have to worry about a thing.

pornel · 2024-10-08T09:45:15 1728380715

JS and Python are still old enough to have been created when Unicode was in its infancy, so they have their own share of problems from using UCS-2 (such as indexing strings by what is now a UTF-16 code unit, rather than by a codepoint or a grapheme cluster).

Swift has been developed in the modern times, and it's able to tackle Unicode properly, e.g. makes distinction between codepoints and grapheme clusters, and steers users away from random-access indexing and having a single (incorrect) notion of a string length.

Muromec · 2024-10-08T10:09:48 1728382188

Well, the only time you can do str lower where unicode locale awareness will be a problem is when you do it on the user input, like names.

How about you just dont? If it's a constant in your code, you probably use ASCII anyway or can do a static mapping. If it's user user input -- just don't str lower / str upper it.

account42 · 2024-10-09T14:19:23 1728483563

Yes, except when it is not your choice. If the requirements are to display some strings in lower/uppercase then you need to find a way to do that. That doesn't have to be using the standard library though.

pjmlp · 2024-10-08T14:05:32 1728396332

Because it is a fight to put anything on a ISO managed language, and only the strongest persevere long enough to make it happen.

Regardless of what ISO language we are talking about.

account42 · 2024-10-09T14:21:23 1728483683

If anything it should be harder to add things to the language. Too many new additions have been half-arsed like and needed to be changed or deprecated soon after.

gpderetta · 2024-10-08T19:57:08 1728417428

Yes, significantly smaller libraries had an hard time getting onto the standard. Getting the equivalent of ICU would be almost impossible. And good luck keeping it up to date.

account42 · 2024-10-09T13:09:25 1728479365

> Makes you wonder why this isn't part of the C++ standard library itself.

Because the C++ standard library cares about binary size and backwards compatiblity, both of with are incompatible with a full Unicode implementation. Putting this in the stdlib means everyone has to pay for it even when you don't need it.

Libraries are fine, not everything needs to be defined by the language itself.

Netch · 2024-10-09T15:30:44 1728487844

> Makes you wonder why this isn't part of the C++ standard library itself.

Plainly no need if there is a separate easily attachable library (and with permissible license). What C++ had to do - provide character (char{8,16,32}_t) and string types - it has done.

Night_Thastus · 2024-10-09T15:22:40 1728487360

As a C++ dev, I have never run into the problem the post is describing. Upper and lowercase conversion has always worked just fine. Though then again, I don't fiddle with mixed unicode and non-unicode situations.

wslh · 2024-10-08T23:21:41 1728429701

Me too, how is case conversion perfectly done in modern languages such as Zig [1], Rust, or Swift?

[1] Ended up looking at https://github.com/JakubSzark/zig-string

steveklabnik · 2024-10-09T00:19:03 1728433143

In Rust, the APIs are clear if they're ASCII only or unicode aware.

https://doc.rust-lang.org/stable/std/primitive.str.html#meth...

> ‘Lowercase’ is defined according to the terms of the Unicode Derived Core Property Lowercase.

https://doc.rust-lang.org/stable/std/primitive.str.html#meth...

> ASCII letters ‘A’ to ‘Z’ are mapped to ‘a’ to ‘z’, but non-ASCII letters are unchanged.

Now, "perfectly" is very strong. For example, the Turkish i problem. That is not solved. But 99% of Unicode stuff is handled correctly by default.

oguz-ismail · 2024-10-09T02:24:34 1728440674

> 99% of Unicode stuff

Does that include context-dependent conversion rules like o'reilly -> O'Reilly?

PaulDavisThe1st · 2024-10-09T05:10:06 1728450606

that is neither up-casing nor-downcasing, but (de)capitalization, which is a significantly more complex task (which ultimately requires up- or down-casing, but a whole lot more before then).

oguz-ismail · 2024-10-09T05:26:00 1728451560

So it doesn't. If Unicode doesn't cover non-trivial forms of case-folding, 99% of Unicode doesn't mean anything.

PaulDavisThe1st · 2024-10-09T17:49:15 1728496155

I am not aware of a Unicode concept of "the latin letter o followed by an apostrophe followed by another latin letter". Unicode would identify the glyphs for such a concept, but I don't see how Unicode is involved in this in anyway as far the process of deciding what "capitalized o'reilly" means.

steveklabnik · 2024-10-09T12:04:04 1728475444

Sort of, see the Greek example elsewhere in this thread. I don’t think that specific situation is part of Unicode though.

hoseja · 2024-10-09T07:50:17 1728460217

>Makes you wonder why this isn't part of the C++ standard library itself.

Because then every change in Unicode would need to be standardized in C++ as well. Yup. Can't have Unicode due to committee friction.

dennis_jeeves2 · 2024-10-09T05:25:07 1728451507

> There are so many ways to do something and every way is freaking wrong!

Stroustrup, laugheth!

BoringTimesGang · 2024-10-08T09:19:56 1728379196

>It is issues like this due to which I gave up on C++. There are so many ways to do something and every way is freaking wrong!

These are mostly unicode or linguistics problems.

tralarpa · 2024-10-08T09:32:07 1728379927

The fact that the standard library works against you doesn't help (to_lower takes an int, but only kind of works (sometimes) correctly on unsigned char, and wchar_t is implicitly promoted to int).

BoringTimesGang · 2024-10-08T09:41:22 1728380482

to_lower is in the std namespace but is actually just part of the C89 standard, meaning it predates both UTF8 and UTF16. Is the alternative that it should be made unusable, and more existing code broken? A modern user has to include one of the c-prefix headers to use it, already hinting to them that 'here be dragons'.

But there are always dragons. It's strings. The mere assumption that they can be transformed int-by-int, irrespective of encoding, is wrong. As is the assumption that a sensible transformation to lower case without error handling exists.

account42 · 2024-10-09T14:27:11 1728484031

> Is the alternative that it should be made unusable, and more existing code broken?

It should be marked [[deprecated]], yes. There is no good reason to use std::tolower/toupper anywhere - they can neither do unicode properly nor are they anywhere close to efficient for ASCII. And their behavior depends on the process-global locale.