Hacker News new | past | comments | ask | show | jobs | submit login
Mojibake (wikipedia.org)
95 points by BerislavLopac on Oct 4, 2022 | hide | past | favorite | 34 comments



The funniest story about mojibake is the one about that letter sent by a French to a Russian address by writing the mojibake as the address, and the Russian ostal service actually understanding what each character meant and decoded in the right charset.

https://unicodebook.readthedocs.io/definitions.html#mojibake


One of the great parts of the Python ecosystem for data processing is https://ftfy.readthedocs.io/en/latest/ which can handle mojibake and many other unicode-related translation problems.

But seriously, I'm always a little upset when data vendors/customers/etc don't specify the encoding they are using. You'd be surprised how many official or unique sources still use weird encodings in the name of compatibility.


If you want to test ftfy online it's available here:

https://ftfy.vercel.app/


FTFY is amazing. Really useful for processing excel generated csvs.


Funny, I have used it for the same use-case (and a sad reminder how horrific Excel's handling of UTF-8 in CSV files can be...)


Mojibake is annoying, but it's also an interesting peek into the inner workings of how computers store and handle data (in this case, characters). It's one of the more mundane examples of "computers store data in 1s and 0s" in action.

Incidentally, a lot of Japanese software are still written to Shift-JIS, so it's still fairly tedious trying to run them in an environment that's not set to Shift-JIS. I wonder if there's an AppLocale equivalent for Windows 11...


I’ve been using Japanese on computers on a daily basis since the mid-1990s. Mojibaké used to be a regular headache, but fortunately I rarely encounter it now.

Most of the mojibaké I do see appears when staff at the Japanese university where I teach send around zipped folders of files with Japanese names. When unzipped, the files often have mojibakéd names. I haven’t yet found a way to repair them. (The contents of the files are fine.)

Almost all of the staff use Windows computers, while I and most of rest of the faculty use Macs.


ZIP famously doesn't specify any character encoding in its file names (a later version of APPNOTE introduced an additional bit to signal that UTF-8 is in use which is to my knowledge not really taken off). Many archivers therefore assumed that they are in the active code page, which meant you can experience mojibake even in Windows and in fact was a major pain when you deal with ZIP files originated from other East Asian countries. Later archivers generally have an option to set or guess the character encoding---if you still have those files, try them.


WinRAR (GUI): "Options -> Name encoding -> Japanese Shift-JIS".

7-zip (CLI only): 7za.exe x -mcp=932 file.zip

There are also online tools [1] that handle this.

[1] https://ianharmon.github.io/mojibake-fixer/


The Unarchiver tries to guess the correct encoding. GUI is for Mac only, but the CLI is cross-platform.

https://theunarchiver.com/


In Russian this phenomenon is called "бНОПНЯ" (read "b-nop-nya") and was caused by taking the word "Вопрос" (meaning: "question") in win-1251 encoding and reading it as if it was in KOI-8 encoding.

Also this is called "крокозябры" (read: kro-ko-zya-bry, nonsense word, no translation) especially when reading a binary file in a text viewer.


> Also this is called "крокозябры" (read: kro-ko-zya-bry

In Esperanto there's krokodili[0] (literally "to crocodile") which is used to describe speaking non-Esperanto among esperantists.

This was further adapted into Toki Pona as "kokosila".

I found it funny how similar they are.

[0] https://en.m.wiktionary.org/wiki/krokodili


It's really difficult to explain the concept of mojibake to the software developer who is still believing that ASCII is fine.

It's also difficult to explain that they are using wrong font to render the kanji because of Han Unification.


The font issue is so challenging. Even Chinese developers (who obviously don’t think ASCII is fine) sometimes won’t understand why using a Chinese font to render kanji is an issue.


There's an inside joke among programmers in Mainland China where the GBK encoding is used.

锟斤拷 (which doesn't mean anything) is the result of interpreting UTF-8's replacement character [1] in GBK.

烫 (hot, scorching) is interpreting 0xcc in GBK. In debug mode, Visual Studio will initialize unused memory with 0xcc.

The inside joke is: 手持两把锟斤拷,口中疾呼烫烫烫 Holding two 锟斤拷 in hands and screaming "hot hot hot"

[1]: https://www.fileformat.info/info/unicode/char/fffd/index.htm


As text processing has moved away from encodings like Windows 1252 and Shift-JIS to UTF-8, mojibake has become much less of an issue. It was a frequent mess in the 1990s though.


I remember frequently having to manually select the correct encoding while browsing in the 90s and early 00s. I’m glad it’s gone.


I still regularly see mojibake from Japanese ZIP files. The worst has indeed passed, but it will remain a lingering problem for decades.


This is a legitimate issue when solving merge conflicts via Azure's built in conflict manager - it will muck you up no matter what if you have any funny punctuation going on.

My previous gig used to have an obscenely contrived scheme of multiple dummy "conflict" branches to solve issues locally whenever a conflict would arise due to that.

Really glad to be off the Microsoft stack today.


So back at one of my first jobs I worked a lot with XML, as a dev you often forget to test some of the odder corner cases but this had come up somewhere and I decided to test it... and lo and behold we failed horribly. Ever since then any time I'm using any sort of serialization format I add mojibake to my tests. My usual sequence these days is either Japanese/Chinese or <string of emoji that hacker news removes> or both. The amount of software claiming to respect encodings that doesn't is quite amazing. Many times they'll include things like the XML declaration and then completely ignore it. Ditto HTML and encoding headers and tags, also byte order marks.


Note that "bake*" in Japanese also means "monster/ghost". I am not sure if intended or not, but can def see this being a magnific pun in the language, since that alt translation would be "character monster".

* Note: not sure if this is actually an official alt meaning in Japanese, an intended pun, or none of the above, just my notes thinking this could be a magnific pun. It uses a different kanji so would only be possible if this was regularly written in katakana or only spoken of.


It's the same word: 化け(る) bake(ru) means "change, transform, alter, corrupt". So a monster is ''o-bake'', "something which has been changed [into a monster]".

That said, no, most Japanese would not associate ''mojibake'' with "character monsters", it's just "altered characters".


Ah nice, thanks for the clarification! Does that "altered" have the connotation of "corrupt" here? Or it could be altered in any generic way?


Bakemono means something more like "changeling", e.g., a tanuki who is assuming human form. The bake in mojibake has more to do with this concept of changing than with ghosts or monsters.


Just yesterday I watched a talk from NDC Copenhagen by Dylan Beattie about this exact topic. The story which stood out the most was this https://www.youtube.com/watch?v=gd5uJ7Nlvvo&t=22m09s

The whole talk was an interesting watch tho.


Is there software for detection of Mojibake or fixing it where it occurs? (other than ftfy mentioned in another comment)


2022: I still can't have the 'é' in my last name on most on my accounts (bank, taxes).


This literally hit me last week. What a ride!

UTF-8 makes life so much simpler.


As someone from the balkans, seeing a "C:đ>" prompt in dos was quite normal for us :)


For Japanese it was C:¥>.


Reminiscent of baud barf


As someone who works on a sensor platform that connects to a seriously arbitrary list of weird things out in the real world to pull data from them, some of which are so poorly documented the only way to work out what baud they are over UART is by just brute forcing it til it looks right: yeah it's very similar. And I need to go fight with this flow-meter now (which is Modbus, but with some of the weirdest data type encodings across registers I've ever seen), wish me luck :')


Godspeed, intrepid hero. Thank you for your service.


the screenshot on the right should include the section with the screenshot of itself




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: