Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

>Once ppl learn about localization the questions like why a programming language does not do this 'simple text operation' are just a newcomer detector. :)

I think you are purposefully misinterpreting the question. They're not asking about converting the case of any Unicode string with locale sensitivity, they're asking about converting the case of ASCII characters.

What if your game needs to talk to a server and do some string manipulation in between requests? Are you really going to architect everything so that the client doesn't need to handle any of that ever?



>What if your game needs to talk to a server and do some string manipulation in between requests? Are you really going to architect everything so that the client doesn't need to handle any of that ever?

Of course! Your string manipulation with user entered attributes like display names or chat messages are 1 millimeter away from old good sql 'Bobby; drop table students'. Never ever do that if you can avoid it. Every time someone 'just concatenates' two strings like to add ie 'symbol that represents input button' programmer makes bad bug that will be both annoying and wrong. Games should use substitution patterns guided by translation team. Because there is no ASCII culture in like around 15 typically supported by big publishers.

There are exceptions like platform provided services to filter ban words in chat. And even there you don't have to do 'things with ASCII characters'. Yeah, players will input unsupported symbols everywhere they can and you need to have good replacement characters for those and fix support for popular emojis regularly. That is expected by communities now.


> They're not asking about converting the case of any Unicode string with locale sensitivity, they're asking about converting the case of ASCII characters.

I'm confused now. The article specifically mentions issues with UTF-16 and UTF-32 unicode characters outside the basic multilingual plane (BMP).


I'm referring to the people who call case conversion in general "a simple text operation". Say you have an std::string and you want to make it lower case. If you assume it contains just ASCII that's a simpler operation than if you assume it contains UTF-8, but C++ doesn't provide a single function that does either of them. A person can rightly complain that the former is a basic functionality that the language should include; personally, I would agree. And you could say "wow, doesn't this person realize that case conversion in Unicode is actually complicated? They must be really inexperienced." It could be that the other person really doesn't know about Unicode, or it could mean that you and them are thinking about entirely different problems and you're being judgemental a bit too eagerly.


For ascii in C++ isn't there std::tolower / std::toupper? If you're not dealing with unsigned char types there isn't a simple case conversion function, but that's for a good reason as the article lays out.


Those functions take and return single characters. What's missing is functions that operate on strings. You can use them in combination with std::transform(), but as the article points out, even if you're just dealing with ASCII you can easily do it wrong. I've been using C++ for over 20 years and I didn't know tolower() and toupper() were non-addressable. There's really no excuse for the library not having simple case conversion functions that operate on strings in-place.


std::transform() seems like overkill when you can just iterate over the string and modify it in place. And in my opinion, tranform is way less readable than seeing a loop over some array with a single operation inside.

The article talks about wstrings for good reason. If you're converting narrow strings, you don't need to be this fancy. Just loop over the string and edit it in place.

If you are operating on wide strings, there is no suitable single solution, partly because wstring is a terrible type. It's different widths on different platforms, and no string encoding format uses a generalized wsring, they have mandatory min/max character byte widths. So a wstring tells you nothing about the actual encoded string contents semantic representation.

The C++ stdlib could include a fully unicode aware string type set, and surrounding library. But personally I think C++ isn't the kind of language to provide an opinionated stdlib module for such a complex task. And there's no way to implement such a module without being very opinionated about something.


> The article talks about wstrings for good reason. If you're converting narrow strings, you don't need to be this fancy. Just loop over the string and edit it in place.

Since you mention narrow strings in the context of wstring, just to make sure... you can't convert a UTF-8 std::string character by character, in-place (in case that's what you meant).

7-bit ASCII code points are fine, but outside that it's not guaranteed that one UTF-8 byte converts into exactly one UTF-8 byte when converting case.


Yeah If you're using narrow strings for UTF8 you're making a mistake. wstrings also are not a good representation because of the platform differences, unless you don't care about Windows in which case it's fine but still not great semantically.

In most type definitions you cannot convert UTF8 via simple iteration because the type generally represents a code point and not a character.

You can have a library where UTF8 characters are a native type and code points are a mostly-hidden internal element. But again, that's highly opinionated for C++.


I'm not 100% sure what you mean by narrow string, but if you refer to std::string vs std::wstring, then std::string is perfectly fine for encoding UTF8, as that uses 8 bit code units which are guaranteed to fit in a char. On the other hand, std::wstring would be a bizarre choice for UTF8 on any platform.


It's not guaranteed for 7-bit ASCII either because tolower/toupper are locale-dependent and with the tr_TR lowercase I (U+0049) is ı (U+0131, aka dotless i) wich encodes as two bytes in UTF-8.


That's not ascii then. It's byte width compatible (to a certain degree as you point out). But it's not ascii. ascii defines 128 code points and the handling of an escape character. It doesn't handle locales.


ASCII is an encoding, it doesn't say anything about locale. The point is that tolower/toupper is not guaranteed to be safe even if the input is 7-bit.


I don't think there is any possibility of doing locale specific lower/upper casing in ASCII. It is really designed for (a subset of) American english.


std::u8string, std::u16string and std::u32string are supposed to be the portable unicode string types, but a lot of machinery is missing and some that has been added has since been deprecated.

> there's no way to implement such a module without being very opinionated about something.

indeed! Boost.Nowide[1] is such an opinionated library.

[1] https://www.boost.org/doc/libs/master/libs/nowide/doc/html/i...


Yep, there's also ICU and utf8cpp, and many others. They all have trade-offs. So I just don't think the stdlib should cover this because there is no objectively best way to handle it.


I know I can simply iterate. The point is that it's a function that should be included, not that it's impossible without it. It's one of the most common string operations.


To me that feels like the JS community asking for left-pad or is-even in a module. Why have a dedicated function for 2 lines of code?

And it's a huge footgun. There is no ascii type in C++. People will use the generalized tolower for UTF8 encoded in narrow strings and have issues.

You could say the generalized tolower should support all the different width/encoding combinations and sort it out. But that's still highly opinionated as far as performance is concerned.

Generalized string conversion is a very complex problem and you really cannot simplify it in a way that will satisfy most C++ users. Just use ICU or utf8cpp if you want to do string operations and don't care what's going on under the hood. But even then I can't recommend just 1 library, because no perfect 3rd party library exists. A perfect first party library definitely could not exist.


>Why have a dedicated function for 2 lines of code?

Then why does std::max() exist?

>People will use the generalized tolower for UTF8 encoded in narrow strings and have issues.

tolower() and toupper() work correctly on UTF-8 strings, because UTF-8 was specifically designed so that non-ASCII characters were represented by sequences of purely non-ASCII bytes.

>Generalized string conversion is a very complex

Hence why people who say C++ should have a tolower() that operates on strings are not asking more complex Unicode support.


> There's really no excuse for the library not having simple case conversion functions that operate on strings in-place.

Could not agree more. Any time I touch a C I want to scoop my brain out of my ear. So many simple unbelievably common operations have fifty "best" ways to do them, when they should have one happy path 99% of usecases require baked in. Nobody should ever have to seriously consider something as ridiculous as "is tolower addressable?".


std::tolower / std::toupper are rubbish functions that can't do proper Unicode but still pull in the bloated locale machinery for what should be a simple conditional integer addition if all you care about is ASCII. Both have no valid use case and should be marked [[deprecated]] and erased from all teaching materials.


> What if your game needs to talk to a server and do some string manipulation in between requests?

What conceivable reason would there be to ever need to do that? If the server takes commands in upper case, then have them in upper case from the start. If the server takes commands in lower case, have them in lower case from the start. If the server specifies that you need to invert the case of its response to use in the next request, find a server developed by someone not crazy.


Case conversion is not the only string manipulation that's locale sensitive.


No reasonable server API should require locale sensitive string manipulation.


Word censoring? Ease of use? Console commands (i.e. from Quake to minecraft)?


Those sound exactly like the newcomer detectors GP was referring to. What you want is a case-insensitive string comparison, and outside ASCII that's not equivalent to just turning both strings to lowercase and checking equality (or doing a substring search or whatever the task requires)


Exactly and where you want case-insensitive comparison you almost always also want other kinds of Unicode normalization.


> Word censoring?

Should only ever be needed for text from the user, and in that case, as GP said, find a way to examine it as-is, don't "convert".

> Ease of use?

What ease of use? When has futzing around with case ever made anything easier?

> Console commands (i.e. from Quake to minecraft)?

Why would those necessitate changing case?


Nobody is thinking about converting the case of ASCII characters. To be thinking that, they are explicitly excluding most of the world's cultures from entering common names correctly. Restricting thought to ASCII is a lack of thought, not an active thought.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: