utf8 vs utf8mb4 bit me in one of the most frustrating bugs I've ever had to debug.
At Scribophile members can write critiques for people's writing, with comments inserted inline. The underlying software dates back to the PHP5 days when MySQL only had the utf8 option. Everything was working fine for years, when all of a sudden users started complaining that from time to time, they would submit a critique and it would be mysteriously cut off at random places.
The problem was very intermittent, didn't happen very often, and there was no seeming rhyme or reason to it. But when it did happen, it was catastrophic because members would lose hours of work, seemingly at random!
All kinds of testing scaffolding and logging was put in place to try to find the problem with no luck. Then, after quite some time, we realized what the problem was.
At that time, emoji keyboards were brand new; most phones/tablets had limited support, and people didn't yet use phones and tablets for everything like they do now. But they were out there. Some users who were using these new emoji keyboards were inserting emoji smiley faces as they were writing critiques. In Scribophile's web interface, everything looked fine; but when the user submitted the critique, MySQL tried to insert a multibyte Unicode character into a regular utf8 field, and SILENTLY threw that character and all of the data after it away!!
Boy were we upset at MySQL about that one. But at least we figured it out!
I had the pleasure of finding that same behaviour in an application around 2015. When someone commented on a food order "Extra cheese please :folded hands: but no shrimp, I'm allergic."
But one thing I haven't found out is why some emojis did go through. The basic ones like a simple :smile: it seemed to me. Could it be that some only need 3 bytes and when more and more emojis got released, they went into 4 bytes?
Edit: HN stripped the emojis from my comment... I would put a rolling eyes emoji here if I could.
Exactly that - three byte emojis work fine in utf8 but 4 byte ones need utf8mb4 - and the four byte ones are the new ones that support the "color variations" and other similar things.
Apple adding those probably caused more upgrades to utf8mb4 than any amount of pleading from languages that actually needed them ever would have.
Sounds like you’re looking at how many scalar values are in an extended grapheme cluster (e.g. U+1F635 U+200D U+1F4AB dizzy face + zero width joiner + dizzy symbol, that’s one “character” made up of three scalar values).
The thing in question here is UTF-8 code units (bytes), and how many UTF-8 needs to encode a single scalar value. UTF-8 needs four bytes for anything above U+FFFF, and almost all emoji are above that.
I've been struggling with fixing some edge cases with emojis in latin1 encoded DB this past couple of days and what you mention here is the most important thing I found yesterday (conincidence).
Unicode strings are made up of Unicode code points, but some code points combine with others to make graphemes (single visual characters). Some emojis, like flags (U+1F1EA U+1F1E6
), are made of two code points.
Correction: Unicode strings are made up of scalar values, not code points. The difference is that accursed mess that UTF-16 foist upon us, surrogates;oU+D800–U+DFFF are code points, but are not valid scalar values and cannot occur in Unicode strings (even if things like JavaScript are stupid in allowing potentially-malformed UTF-16, which is incompatible with Unicode, and Python similarly stupid in having strings that are not Unicode but are instead sequences of code points).
To give a slightly more complete answer as to why, utf-8 was designed to be both backwards compatible with ASCII and self-synchronizing.
So you can't use half the values of a byte when encoding code points outside of ASCII, and the remaining 128 values need to be split between starting bytes and continuation bytes. At best you're going to fit about 6 bits into each byte, and unicode spans the equivalent of just over 20 bits, so you need 4 bytes max.
utf-8 opts to keep encoding very simple, and in 2-4 bytes it can store 11, 16, and 21 bits respectively.
If you discarded ASCII compatibility, you could create a 1-3 byte self-synchronizing unicode encoding. But at that point maybe you should be compressing your text instead.
Or you could weaken the compatibility and create "utf-1 but better" to fit into 3 bytes.
> Actually it does seem feasible to make an encoding that's backwards compatible with all printable ASCII characters, and uses 3 bytes for all other code points.
Sorry, you can’t squeeze Unicode in that way, even unsynchronised: you fall just over 2½ bits short. No, reserving the 95 printable ASCII characters scuttles your hopes, meaning you get less than 1½ bits of that byte to use for the rest of Unicode, when you needed a touch over 4 bits.
To represent all the rest: Unicode scalar values minus printable ASCII is U+0000–U+10FFFF minus U+D800–U+DFFF minus U+0020–U+007E: 1,111,969 possibilities, log₂(1111969) ≈ 20.085 bits of entropy. (Removing the 2048 surrogates and 95 printable ASCII characters barely made a dent, under 0.003 bits, but I had to remove it for informational correctness.)
95 printable ASCII characters, log₂(95) ≈ 6.57 bits. Subtract that reservation from the three bytes available and you’ve got about 17.43 bits left for the rest of Unicode, but you needed over 20.
[Edit: I observe you’ve removed what I was quoting before I posted my comment. Still fun mathematics.]
> If you discarded ASCII compatibility, you could create a 1-3 byte self-synchronizing unicode encoding.
Not quite. If you discarded ASCII compatibility, then you’d have almost four bits to spare, which is enough for a self-synchronising fixed-width 3-byte encoding (reserving the first bit of each byte to signal if it’s a start byte) or for an unsynchronised variable-width encoding of some kind; but to make it even 2-or-3 or 1-or-3 (rather than 1–3) bytes wide variable-width, you’d need four bits (3 + log₂ 2) for self-synchronisation, and you’re about 0.085 bits short. (1–3 bytes variable-width would require 3 + log₂ 3 ≈ 4.58 bits.)
> Sorry, you can’t squeeze Unicode in that way, even unsynchronised: you fall just over 2½ bits short. No, reserving the 95 printable ASCII characters scuttles your hopes, meaning you get less than 1½ bits of that byte to use for the rest of Unicode, when you needed a touch over 4 bits.
So it depends on exactly what we mean by compatible, and that's why I removed it.
But in particular, the encoding I had in mind when I wrote that was one that allows most ASCII characters as the middle byte of a triplet. That way whether you truncate from beginning or end you'll never have a rogue byte successfully decode.
So something like: Lead bytes from C0-FF, middle bytes from 00-BF, final bytes from 00-1F or 7F-BF.
If you also exclude null, tab, newline, and line feed, that gives you 64 x 188 x 93, which is just enough for every code point.
But arguably that's not compatible enough because you couldn't search for single ASCII characters with a dumb byte-wise algorithm.
> Not quite. If you discarded ASCII compatibility, then you’d have almost four bits to spare, which is enough for a self-synchronising fixed-width 3-byte encoding (reserving the first bit of each byte to signal if it’s a start byte) or for an unsynchronised variable-width encoding of some kind; but to make it even 2-or-3 or 1-or-3 (rather than 1–3) bytes wide variable-width, you’d need four bits (3 + log₂ 2) for self-synchronisation, and you’re about 0.085 bits short. (1–3 bytes variable-width would require 3 + log₂ 3 ≈ 4.58 bits.)
I don't know where you got that equation but it's not the right one for the situation.
Here's a really simple encoding just to disprove it: Leading bytes encode the plane (17 options) plus two more bits into decimal values 0 through 67. Continuation bytes encode 7 bits into decimal values 128 through 255. 17 planes + 2 bits + 7 bits + 7 bits, perfect fit. The remaining values, 68-127, can be used for single byte (and two byte) encodings.
But also the definition I was using for self-synchronizing doesn't seem to match the one on wikipedia. My intent was a code where you can guarantee sync after a specific small number of bytes (unlike UTF-1). In particular, if you let standalone bytes and continuation bytes overlap each other, you can still guarantee sync within 3 bytes. This gives you lots of space to play with, and you could make a very useful encoding. You could have >150 single byte codepoints and thousands of double byte codepoints and fit everything else into three bytes.
My (3 + log₂ 2) came from reckoning that you needed one whole bit as a signal in each byte (3), and that the byte-width thing (log₂ 2) was extra to that. As you correctly point out, this was incorrect: the first byte doesn’t need one full bit of entropy to signal that it’s the first byte.
I use “—⁂—” as my horizontal rule in plain text, via my Compose key (Compose+h+r).
HN removes things that are officially marked as emoji, normalises whitespace to space (which annoys me because I sometimes want to type things like narrow no-break space in equations which it makes wider and loses the no-breakyness of), and that’s all I’ve noticed. Control characters probably get some treatment or other too.
If 21 bits seems like an odd place to stop, it’s because the Unicode consortium said they wouldn’t allocate any codepoints beyond U+10FFFF (the end of plane 16), because that’s as far as UTF-16 could go using a four-byte surrogate pair (10+10 bits, offset by U+10000).
Not that UTF-8 really stops at 21. As originally formulated it has 5 and 6 byte encodings for 26 and 31 bits, but an entire unsigned int of code points was later decided to be unnecessary, and those encodings were marked as invalid.
What I am curious about is why couldn't they fix it?
At one point the change was made from 6 to 3 bytes, I expect this made the on disk structure incompatible with prior versions. why were they able to make this change but unable to make a later change say from 3 to 4 bytes.
I am not as familiar with mysql but note that postgres major versions are not disk compatible with each other(you have to dump and restore). I expect this is the exact reason. you want a system where it is possible to make changes.
see also: the drama around linux abi changes vs openbsd abi changes.
To meme off of raul julia: openbsd to linux: "For you, the day your abi changed was the most important day of your life. But for us, it was Tuesday."
VARCHAR(N) can store N characters. So with utf8mb3, that's a max of 3N bytes worst-case. But if these columns were silently/automatically converted to utf8mb4 upon upgrading MySQL, they would now need to store max 4N bytes, which (with a high N) can exceed internal limits such as maximum length of an index key.
IIRC, there were additional problems in older versions of MySQL, situations where sort buffers were sized to a fixed length equal to the value's worst-case size or something like that. So sorting a large number of utf8mb4 values would use a lot more memory than utf8mb3 values (again, iirc, I might be wrong on this).
So the safer and more backwards-compatible approach was to introduce utf8mb4 as a new separate charset, and allow users to choose. MySQL 8 is now transitioning towards deprecating utf8mb3, and will finally make the utf8 alias point to utf8mb4 sometime in the near future.
That said, there are still a bunch of unpleasant uses of utf8mb3 internally in things like information_schema. I develop schema management tooling and recently lost a week to writing work-arounds for some of the more obscure ones, see https://github.com/skeema/skeema/commit/bf38edb :)
> they would now need to store max 4N bytes, which (with a high N) can exceed internal limits such as maximum length of an index key.
Specifically, indexes are limited to 767 bytes (some say 768?), which means 191 utf8mb4 characters. So your otherwise-innocuous VARCHAR(255) primary key breaks. I’ve seen this multiple times, e.g. django-celery probably still has this: https://github.com/celery/django-celery/issues/259.
fwiw the limit is effectively 3072 bytes nowadays, default since MySQL 5.7.
You could actually bump it up to 3072 in 5.5 and 5.6 too, but it required changing several settings (innodb_large_prefix, innodb_file_format, innodb_file_per_table, and the table's ROW_FORMAT).
Cool, didn’t know that; thanks for the correction. I haven’t touched MySQL since 2016, and I suppose what I worked with then wasn’t 5.7, or was configured differently for historical reasons, or something.
What's the problem with allowing VARCHAR(N) store a max of 3N bytes worst-case, but still allow the 4-byte characters to be present in the string? Sure, it means that you can only store 0.75 * N emojis in a VARCHAR(N), but that seems less bad than the alternative of silently corrupting data.
4-byte characters are simply not valid in the utf8mb3 character set. The column type defines what values may be stored in the column, and the character set is inherently part of the column type for textual data.
So your question is similar to asking, what's the problem with allowing storage of negative numbers in an unsigned int column. The answer is those bytes inherently have a different meaning, or may be completely invalid, depending on the column type.
> that seems less bad than the alternative of silently corrupting data
MySQL has defaulted to a strict sql_mode out-of-the-box since MySQL 5.7, released over 6 years ago. With strict sql_mode, no silent corruption of data occurs. If you attempt to insert a 4-byte character into a utf8mb3 column, your insert fails with an error.
Everything older than 5.7 has hit end-of-life, so there is literally no supported version of MySQL which doesn't default to enabling strict mode.
The only problem is that database-as-a-service providers use nonstandard defaults, so e.g. AWS RDS doesn't come with strict mode enabled, for whatever completely insane reason. But that's technically an Amazon problem, not a MySQL problem.
Yes, you're correct that utf8mb3 is strictly a subset of utf8mb4, and the binary value of any valid utf8mb3 character is identical to its utf8mb4 representation. But I'd argue that's irrelevant.
I meant my statement generically: the meaning of the bytes stored in a column value depend entirely on the column type, as does the range of values that may be stored or converted.
If a column's type is, say, VARCHAR(100) CHARACTER SET utf8mb3, this means the column can store up to 100 utf8 characters and each character has a maximum length of 3 bytes. It is simply not legal to store 4-byte characters in this data type, just as it isn't legal to store 101 characters in it.
The proposal in this subthread seems to be, instead of introducing utf8mb4, why didn't MySQL allow "utf8" (utf8mb3) columns store 4-byte characters, just without affecting the maximum number of bytes that can be stored in the column? So, a VARCHAR(100) CHARACTER SET utf8 might only be able to store max 75 characters if they're all emoji. Additionally, it would mean that the range of values allowed in a column would depend on the DB server version, and would change upon upgrading. This just seems completely counterintuitive and wrong to me, relative to how all other column type definitions work. The modern progression of MySQL has been to remove these obvious-WTF problems, not introduce more of them :)
The proposal was described as "seems less bad than the alternative of silently corrupting data", but that isn't the alternative (silent data corruption doesn't occur with strict mode which is enabled by default), so this comparison doesn't make sense.
Everything that decodes that utf-8 has to support 21-bit codepoints. If they used a uint16_t anywhere in the processing chain then it's restricted to the BMP.
You are correct, it is a disk-format issue, and MySQL officially supports in-place upgrade between versions.
It's actually quite hard to fix bugs in charset/collations, because any changes to the sort order implicitly could affect the on-disk format (indexes are sorted).
The 2002 diff got me interested in the context of why that change was made, which lead me to this 2006 bug which I think might shed light on why '3' was chosen:
"The UTF-8 encoding is a variable byte length encoding. The ASCII subrange of characters is one byte long, alphabetic writings 2 bytes long, Far Eastern writings 3 bytes long per character (and some rare historical writings are mapped using 4 bytes)."
UTF-8 is still something I'm learning the details about, but it looks like back in 2002 they simply didn't see the need for the supplemental plane characters, ie, playing cards, arrows, or ancient Greek numbers. It might be short sighted but we shouldn't judge so far in the future, recalling that emoji wasn't even in Unicode until 2010.
If it was called just "string" or something and then it quietly explained somewhere that it's actually only the BMP and not all of Unicode that's not so surprising, but they deliberately called it UTF-8 even though it just isn't.
Imagine if you bought a USB mouse and then you discovered, oh, they didn't bother implementing USB, this is just a PS/2 mouse with the USB connector. Works fine so long as you use the supplied PS/2 "adaptor" but it isn't actually USB even though it says it is on the box. That mouse is defective. MySQL is defective as supplied. Which won't come as a surprise to many experienced MySQL developers.
I actually rather like MySQL, but it's like bash or Excel, it is a good tool for quick estimates or prototyping, and a poor tool for actual engineering. For the cost of all the work needed to get reliable high quality results with MySQL you could have used a different database where the baseline effort to get it working was higher but what you get is more solid, and "utf8mb4" is an example of that extra work in MySQL.
While searching yesterday, I recall the mysql documentation did specify this this quirk.
As I gather, this change happened prior to or at the change to 4 char max utf; was supplemental plane support required at that time? 3 char is still utf8, just not fully supported, but I really do not know.
Not to defend, just saying things were new and fluid back then, though it should have been fixed properly alot sooner.
In Modern MySQL the documentation does - if you look - tell you that utf8mb4 exists to actually do UTF-8 and if you think to check it mentions that their utf8 is not in fact UTF-8 but only "utf8mb3".
But fifteen years ago most of that documentation doesn't exist, and utf8mb4 doesn't exist, and if you either read the MySQL documentation carefully and realise "utf8" isn't UTF-8, and only encodes the Basic Multilingual Plane, you are stuck anyway, MySQL doesn't have a sensible way forward for you.
I would argue that by 2002 when this bug (let's call it what it is) was introduced, the writing was on the wall, UCS-2 is dead, which means the "it's just easier" justification for choosing 16-bit is also dead, so UTF-8 is the natural choice for software that isn't meanwhile committed to a 16-bit encoding anyway (as Java and Windows are). That makes shipping UTF-8 a good idea, but it makes choosing utf8mb3 a really weird idea that already seems certain to cause pain.
Could be. And its weird especially when supporting all 4 bytes would be easy (so it seems). For fun I did dig around though, and found that even in the earliest documentation they clearly state only supporting 3 bytes, though not why. It says this in 2005 which only justifies not supporting 5/6 bytes, but not four. So confusing:
"RFC 3629 describes encoding sequences that take from one to four bytes. Currently, MySQL UTF8 support does not include four-byte sequences. (An older standard for UTF8 encoding is given by RFC 2279, which describes UTF8 sequences that take from one to six bytes. RFC 3629 renders RFC 2279 obsolete; for this reason, sequences with five and six bytes are no longer used.)" ...Wha?
I personally think anything critical that was worth storing in a new character format still should have been properly tested and documentation consulted, that is on the end-user. Anything not critical where you wouldn't expect to care...still should have tested, but yeah that sucks.
What we can judge harshly is that it accepted the 4-byte sequences and failed silently, corrupting data.
A sane system would reject the 4-byte sequences so you could try again without the emojis.
But if users got errors, they might investigate, and it would slow down the rapid growth mysql was aiming for. So they just decided not to throw errors.
I agree, although the change was prior to 4-byte utf (2002 vs 2003) the same issue would have manifested with 6 byte sequences. Very curious.
It would be really interesting to see a postmortem by mysql if ever possible, because the more I think about, 3 is such an odd choice (pun), and people usually do most things for a reason haha, its either lost to time, or a brainfart off-by-one.
UTF-8 used to support up to 6 bytes to encode a unicode scalar value. Only later was the unicode range restricted such that a max of 4 bytes was required to encode any scalar value using UTF-8.
It looks like when UTF-8 was originally added to MySQL it still supported up to 6 bytes per "character", so the decision to restrict it was actually made by both MySQL and RFC 3629 - unfortunately they chose different limits :(
What happened is Unicode committed forever to a range zero to U+10FFFF and RFC 3629 follows that. You don't need 5 or 6 byte UTF-8 to express this range, and so they're forbidden.
So RFC 3629 isn't choosing arbitrarily, the choice in MySQL isn't entirely arbitrary either, it's just wrong (and should not have been named "utf8").
You could use the 3-byte UTF-8-like encoding MySQL has to do CESU-8 but that's not UTF-8. If MySQL wanted to do CESU-8 they could have called the type cesu8, giving away what's actually going on and necessitating a lot of extra work by MySQL. They did not choose to do that.
Right, but my point was that the "U+10FFFF" limit hadn't been decided yet at the point where MySQL first added the "UTF-8" encoding, and it was added later, so it may not have been clear to them that UTF-8 would ever be a "fixed target" so to speak.
In retrospect we can see that UTF-8 has been and will continue to be static, since it's become so universally adopted.
Speaking of expensive screwups, one of the most expensive schematic errors ever was the routing of the interrupts on the original IBM PC. It put the serial interrupt on the lowest-priority interrupt pin, instead of the highest. This spawned a trillion-dollar 'communications card' industry that was arguably not necessary. Certainly not necessary to the degree it grew.
Interrupts should be prioritized by latency. Instead, the designer prioritized them by, I don't know, their most favorite to least favorite? E.g. the disk transfer completion interrupt was very high, even though it literally has infinite latency (nothing irrecoverable is going to happen if it's delayed indefinitely).
If I had a time-travel machine that only allowed a 1 centimeter window into the past, I would open a window to that original schematic and swap the disk and comm interrupt lines. That would have changed silicon valley history.
> Regarding the different collation algorithms: The unicode variants should be slower than the general variant due to the more complex algorithms, but this was years ago when computers were much weaker than now. With modern hardware, the speed gain should be hardly notable.
...
> Furthermore, we should prefer accuracy over speed. There is no need to use an old and quirky collation like utf8mb4_general_ci with modern hardware.
This is kind of oversimplified and depends a lot on your use case. Performance can absolutely matter even with modern hardware. There is no one-size-fits-all collation that is best in all scenarios, if there was, we really wouldn't need the option to choose.
If you have a thousand rows in your table, sure, use whatever. If you have a billion rows in your table, things are different and you may not be able to afford leaving performance on the table.
I admit it may be a simple view, but I still believe it is the best choice for most of us.
However, if you are in a position where you have to handle such an extensive dataset, you probably know about this detail and can decide if this is a suitable option or not.
> One last concern is the size stored on the disk because UTF-8 supports four bytes per character. But since UTF-8 is a variable-width character encoding, it only uses as many bytes as needed. For example, the first 128 characters like ASCII only use one byte
I'm not sure this is true if you use CHAR instead of VARCHAR.
Moreover if I'm not misremembering indexes still store the full fixed size 3 or 4 byte representation and index size might matter.
However, I'll still use mb4 to avoid the fun times other have already described in the comments
> For example utf8mb4_general_ci does not knwo how to sort s and the German character ß (“sharp S”) in contrast to utf8mb4_unicode_520_ci which sorts ß just fine
If I use utf8mb4_general_ci now, is it guaranteed to produce the same results on a given dataset each time, and to keep producing those same results under future versions of MariaDB?
Or is using it on data that contains characters for which it does not define a sort order like using undefined behavior in C, and it is free to produce different results from run to run and from version to version?
I understand why this would affect collation and indices, but why can't I even store the four byte characters in an utf8mb3 table? The SQL parser and printer seems to deal with the characters fine, because it is independent from the table. I would assume it would just stuff the UTF-8 bytes in there as they are. Is it trying to validate them? Or is it using UTF-32 (or a truncated version) internally and saving everything as 3-4 bytes?
Does posgres support case insensitive collation on standard varcharrs? Last time I used it that was not an option so it made everything a lot more complicated.
Postgrges also did not support hints for queries so tweaking performance was more complicated.
Both MySQL and postgres performed much worse with complex queries than MSSQL even after tweaking.
At Scribophile members can write critiques for people's writing, with comments inserted inline. The underlying software dates back to the PHP5 days when MySQL only had the utf8 option. Everything was working fine for years, when all of a sudden users started complaining that from time to time, they would submit a critique and it would be mysteriously cut off at random places.
The problem was very intermittent, didn't happen very often, and there was no seeming rhyme or reason to it. But when it did happen, it was catastrophic because members would lose hours of work, seemingly at random!
All kinds of testing scaffolding and logging was put in place to try to find the problem with no luck. Then, after quite some time, we realized what the problem was.
At that time, emoji keyboards were brand new; most phones/tablets had limited support, and people didn't yet use phones and tablets for everything like they do now. But they were out there. Some users who were using these new emoji keyboards were inserting emoji smiley faces as they were writing critiques. In Scribophile's web interface, everything looked fine; but when the user submitted the critique, MySQL tried to insert a multibyte Unicode character into a regular utf8 field, and SILENTLY threw that character and all of the data after it away!!
Boy were we upset at MySQL about that one. But at least we figured it out!