The funniest story about mojibake is the one about that letter sent by a French to a Russian address by writing the mojibake as the address, and the Russian ostal service actually understanding what each character meant and decoded in the right charset.
One of the great parts of the Python ecosystem for data processing is https://ftfy.readthedocs.io/en/latest/ which can handle mojibake and many other unicode-related translation problems.
But seriously, I'm always a little upset when data vendors/customers/etc don't specify the encoding they are using. You'd be surprised how many official or unique sources still use weird encodings in the name of compatibility.
Mojibake is annoying, but it's also an interesting peek into the inner workings of how computers store and handle data (in this case, characters). It's one of the more mundane examples of "computers store data in 1s and 0s" in action.
Incidentally, a lot of Japanese software are still written to Shift-JIS, so it's still fairly tedious trying to run them in an environment that's not set to Shift-JIS. I wonder if there's an AppLocale equivalent for Windows 11...
I’ve been using Japanese on computers on a daily basis since the mid-1990s. Mojibaké used to be a regular headache, but fortunately I rarely encounter it now.
Most of the mojibaké I do see appears when staff at the Japanese university where I teach send around zipped folders of files with Japanese names. When unzipped, the files often have mojibakéd names. I haven’t yet found a way to repair them. (The contents of the files are fine.)
Almost all of the staff use Windows computers, while I and most of rest of the faculty use Macs.
ZIP famously doesn't specify any character encoding in its file names (a later version of APPNOTE introduced an additional bit to signal that UTF-8 is in use which is to my knowledge not really taken off). Many archivers therefore assumed that they are in the active code page, which meant you can experience mojibake even in Windows and in fact was a major pain when you deal with ZIP files originated from other East Asian countries. Later archivers generally have an option to set or guess the character encoding---if you still have those files, try them.
In Russian this phenomenon is called "бНОПНЯ" (read "b-nop-nya") and was caused by taking the word "Вопрос" (meaning: "question") in win-1251 encoding and reading it as if it was in KOI-8 encoding.
Also this is called "крокозябры" (read: kro-ko-zya-bry, nonsense word, no translation) especially when reading a binary file in a text viewer.
The font issue is so challenging. Even Chinese developers (who obviously don’t think ASCII is fine) sometimes won’t understand why using a Chinese font to render kanji is an issue.
As text processing has moved away from encodings like Windows 1252 and Shift-JIS to UTF-8, mojibake has become much less of an issue. It was a frequent mess in the 1990s though.
This is a legitimate issue when solving merge conflicts via Azure's built in conflict manager - it will muck you up no matter what if you have any funny punctuation going on.
My previous gig used to have an obscenely contrived scheme of multiple dummy "conflict" branches to solve issues locally whenever a conflict would arise due to that.
So back at one of my first jobs I worked a lot with XML, as a dev you often forget to test some of the odder corner cases but this had come up somewhere and I decided to test it... and lo and behold we failed horribly. Ever since then any time I'm using any sort of serialization format I add mojibake to my tests. My usual sequence these days is either Japanese/Chinese or <string of emoji that hacker news removes> or both. The amount of software claiming to respect encodings that doesn't is quite amazing. Many times they'll include things like the XML declaration and then completely ignore it. Ditto HTML and encoding headers and tags, also byte order marks.
Note that "bake*" in Japanese also means "monster/ghost". I am not sure if intended or not, but can def see this being a magnific pun in the language, since that alt translation would be "character monster".
* Note: not sure if this is actually an official alt meaning in Japanese, an intended pun, or none of the above, just my notes thinking this could be a magnific pun. It uses a different kanji so would only be possible if this was regularly written in katakana or only spoken of.
It's the same word: 化け(る) bake(ru) means "change, transform, alter, corrupt". So a monster is ''o-bake'', "something which has been changed [into a monster]".
That said, no, most Japanese would not associate ''mojibake'' with "character monsters", it's just "altered characters".
Bakemono means something more like "changeling", e.g., a tanuki who is assuming human form. The bake in mojibake has more to do with this concept of changing than with ghosts or monsters.
As someone who works on a sensor platform that connects to a seriously arbitrary list of weird things out in the real world to pull data from them, some of which are so poorly documented the only way to work out what baud they are over UART is by just brute forcing it til it looks right: yeah it's very similar. And I need to go fight with this flow-meter now (which is Modbus, but with some of the weirdest data type encodings across registers I've ever seen), wish me luck :')
https://unicodebook.readthedocs.io/definitions.html#mojibake