Which is why we have languages like Go where we can put those types of developers. Incidentally Go use UTF-8. Higher level languages like Go, Python, etc were designed so newbie and/or ignorant programers could do less damage.
When I was working on a project before Unicode we would switch our dev PCs to the other languages we supported. What a pain that was. Only issues we had was when a translated string was much longer than the screen space allocated to it. I belive Swedish was the main culprit. No problems with simplified and traditional Chinese as those were more compact. I have no sympathy for dev shops that can't get internationalization right. As with everything else in the corporate dev world management doesn't seem to want to hire/retain the more experienced programmers.
I think you have a gripe with my argument because you may be missing my point. If a high level language chooses to let a programmer index into a UTF-8 string at the byte level (for performance and other reasons) it's very easy for it to prevent the the programmer from slicing in the middle of a code unit.
The reason being is that the language function to slice a unicode string would either throw an exception or just advance to the next valid index. There wouldn't be a way for the programmer to slice a unicode string in the middle of a code unit.
I think you have a gripe with my argument because you may be missing my point
I get your point, it just doesn't apply to many real world situations I've seen where you don't have the luxury of just using a higher level language or a library that takes care of all these things, or keeping programmers who don't understand what they are doing away from that sort of thing.
The most egregious example that I've personally seen was a developer working on a legacy Cobol banking program that needed Chinese support retro-fitted to it.
The app was originally only developed with ASCII in mind and so sliced through strings willy-nilly, which naturally caused problems with Chinese text.
The developer working on the "fix" before me, was calling out to ICU through the C API of the version of Cobol that we used and was still messing things up - he'd actually modified ICU in some custom way to prevent the bug from crashing the program, but was still causing corrupted text.
I basically undid all his changes, and wrapped all COBOL string splicing to call a function that always split a string at a valid position - truncating invalid bytes at the start/end as necessary. Much simpler and resulted in the removal of an unnecessary dependency on ICU.
This bug had been outstanding for several months when I first joined that company, and it was the first one I was assigned to work on - and luckily for them they'd accidentally hired someone who had done lots of multilingual programming before.
it's very easy for it to prevent the the programmer from slicing in the middle of a code unit.
Okay, but even you made a mistake in your first example of what to do, and that's the sort of code that someone who knows what they are doing could write, and will seem to work in the conditions under which it was tested (working on my machine, ship it!), but that will cause seemingly random problems once it hits users.
> I get your point, it just doesn't apply to many real world situations I've seen where you don't have the luxury of just using a higher level language or a library that takes care of all these things
No, I still think your missing some of it. I am not advocating that what I said is the solution for everything.
Someone said that slicing UTF-8 strings leads to string corruption and endorsed the Python 3 frankenstien unicode type as a way to avoid it. I just gave a way of preventing that.
Now you argued that a novice programmer would fail to implement it properly. So you're comparing my method implemented by a novice programmer to a method implemented by profesional compiler writers. That hardly seems fair. :)
So my argument is that if my method were to be implemented by professional compiler writers it would prevent corrupted strings while still using UTF-8 as the internal representation.
> I basically undid all his changes, and wrapped all COBOL string splicing to call a function that always split a string at a valid position - truncating invalid bytes at the start/end as necessary.
> luckily for them they'd accidentally hired someone who had done lots of multilingual programming before.
So an expert programmer implemented a string splitting function that didn't corrupt strings. :D
> but even you made a mistake in your first example of what to do
I writing this on an iPad while watching TV and playing a game on another android tablet while looking at the wikipedia UTF-8 article on a tiny phone screen while a little white dog is trying to bite my fingers (wish I was making this up). Not exactly my usual programming environment. ;)
When I was working on a project before Unicode we would switch our dev PCs to the other languages we supported. What a pain that was. Only issues we had was when a translated string was much longer than the screen space allocated to it. I belive Swedish was the main culprit. No problems with simplified and traditional Chinese as those were more compact. I have no sympathy for dev shops that can't get internationalization right. As with everything else in the corporate dev world management doesn't seem to want to hire/retain the more experienced programmers.
I think you have a gripe with my argument because you may be missing my point. If a high level language chooses to let a programmer index into a UTF-8 string at the byte level (for performance and other reasons) it's very easy for it to prevent the the programmer from slicing in the middle of a code unit.
The reason being is that the language function to slice a unicode string would either throw an exception or just advance to the next valid index. There wouldn't be a way for the programmer to slice a unicode string in the middle of a code unit.