I would argue that for most programs when you're doing string manipulation you'r...

pistoleer · 2024-10-08T09:23:04 1728379384

> I would argue that for most programs when you're doing string manipulation you're doing it for internal programming reasons - logs, error messages, etc. In that case you are in nearly full control of the strings and therefore can declare that you're only working with ASCII.

Why would you argue that? In my experience it's about formatting things that are addressed to the user, where the hardest and most annoying localization problems matter a lot. That includes sorting the last name "van den Berg" just after "Bakker", stylizing it as "Berg, van den", and making sure this capitalization is correct and not "Van Den Berg". There is no built in standard library function in any language that does any of that. It's so much larger than ascii and even larger than unicode.

Another user said that the main takeaway is that you can't process strings until you know their language (locale), and that is exactly correct.

daemin · 2024-10-08T10:12:48 1728382368

I would maintain that your program has more string manipulation for error messages and logging than for generating localised formatted names.

Further I do say that if you're creating text for presenting to the user then the most common operation would be replacement of some field in pre-defined text.

In your case I would design it so that the correctly capitalised first name, surname, and variations of those for sorting would be generated at the data entry point (manually or automatically) and then just used when needed in user facing text generation. Therefore the only string operation needed would be replacement of placeholders like the fmt and standard library provide. This uses more memory and storage but these are cheaper now.

pistoleer · 2024-10-08T11:14:11 1728386051

I agree, but the logging formatters don't really do much beyond trivially pasting in placeholders.

And as for data entry... Maybe in an ideal world. In the current world, marred by importing previously mangled datasets, a common solution in the few companies I've worked at is to just not do anything, which leaves ugly edges, yet is "good enough".

heisenzombie · 2024-10-08T09:26:55 1728379615

File paths? I think filesystem paths are generally “bags of bytes” that the OS might interpret as UTF-16 (Windows) or UTF-8 (macOS, Linux).

For example: https://en.m.wikipedia.org/wiki/Program_Files#Localization

vardump · 2024-10-08T09:28:41 1728379721

File paths are scary. The last I checked (which is admittedly a while ago), Windows didn't for example care about correct UTF-16 surrogate pairs at all, it'd happily accept invalid UTF-16 strings.

So use standard string processing libraries on path names at your own peril.

It's a good idea to consider file paths as a bag of bytes.

netsharc · 2024-10-08T10:00:48 1728381648

IIRC, the FAT filesystem (before Windows 95) allowed lowercase letters, but there's a layer in the filesystem driver that converted everything to uppercase, e.g. if you did the command "more readme.txt", the more command would ask the filesystem for "readme.txt" and it would search for "README.TXT" in the file allocation table.

I think I once hex-edited the FA-table to change a filename to have a lowercase name (or maybe it was disk corruption), trying to delete that file didn't work because it would be trying to delete "FOO", and couldn't find it because the file was named "FOo".

Someone · 2024-10-08T09:51:34 1728381094

> It's a good idea to consider file paths as a bag of bytes

(Nitpick: sequence of bytes)

Also very limiting. If you do that, you can’t, for example, show a file name to the user as a string or easily use a shell to process data in your file system (do you type “/bin” or “\x2F\x62\x69\x6E”?)

Unix, from the start, claimed file names where byte sequences, yet assumed many of those to encode ascii.

That’s part of why Plan 9 made the choice “names may contain any printable character (that is, any character outside hexadecimal 00-1F and 80-9F)” (https://9fans.github.io/plan9port/man/man9/intro.html)

daemin · 2024-10-08T09:53:20 1728381200

That's what I mean, you treat filesystem paths as bags of bytes separated by known ASCII characters, as the only path manipulation that you generally need to do is to append a path, remove a path, change extension, things that only care about those ASCII characters. You only modify the path strings at those known characters and leave everything in between as is (with some exceptions using OS API specific functions as needed).

numpad0 · 2024-10-09T00:47:23 1728434843

Just using UTF-8 for username at all is problematic. That has been a major PSA item for Windows users in my language literally since 90s and still is. Microsoft switched home folder names from Microsoft Account username to shortened user email for that reason.

account42 · 2024-10-09T15:04:58 1728486298

Yes and most importantly, that interpretation is for display purposes ONLY. If your file manager won't let me delete a file because the name includes invalid UTF-16/UTF-8 then it is simply broken.

BoringTimesGang · 2024-10-08T09:24:02 1728379442

Now double all of that effort, so you can get it to work with Windows' UTF-16 wstrings.

account42 · 2024-10-09T15:10:44 1728486644

Better to just convert WTF-16 (Windows filenames re not guaranteed to be valid UTF-16) to/from WTF-8 at the API boundary and then do the same processing internally on all platforms.