Database character sets and collations explained – why utf8 is not UTF-8

acabal · on Jan 5, 2022

utf8 vs utf8mb4 bit me in one of the most frustrating bugs I've ever had to debug.

At Scribophile members can write critiques for people's writing, with comments inserted inline. The underlying software dates back to the PHP5 days when MySQL only had the utf8 option. Everything was working fine for years, when all of a sudden users started complaining that from time to time, they would submit a critique and it would be mysteriously cut off at random places.

The problem was very intermittent, didn't happen very often, and there was no seeming rhyme or reason to it. But when it did happen, it was catastrophic because members would lose hours of work, seemingly at random!

All kinds of testing scaffolding and logging was put in place to try to find the problem with no luck. Then, after quite some time, we realized what the problem was.

At that time, emoji keyboards were brand new; most phones/tablets had limited support, and people didn't yet use phones and tablets for everything like they do now. But they were out there. Some users who were using these new emoji keyboards were inserting emoji smiley faces as they were writing critiques. In Scribophile's web interface, everything looked fine; but when the user submitted the critique, MySQL tried to insert a multibyte Unicode character into a regular utf8 field, and SILENTLY threw that character and all of the data after it away!!

Boy were we upset at MySQL about that one. But at least we figured it out!

webtopf · on Jan 5, 2022

I had the pleasure of finding that same behaviour in an application around 2015. When someone commented on a food order "Extra cheese please :folded hands: but no shrimp, I'm allergic."

But one thing I haven't found out is why some emojis did go through. The basic ones like a simple :smile: it seemed to me. Could it be that some only need 3 bytes and when more and more emojis got released, they went into 4 bytes?

Edit: HN stripped the emojis from my comment... I would put a rolling eyes emoji here if I could.

bombcar · on Jan 5, 2022

Exactly that - three byte emojis work fine in utf8 but 4 byte ones need utf8mb4 - and the four byte ones are the new ones that support the "color variations" and other similar things.

Apple adding those probably caused more upgrades to utf8mb4 than any amount of pleading from languages that actually needed them ever would have.

kingcharles · on Jan 5, 2022

I looked through the master emoji list and I can't see that any expand into the 4th byte, but this stuff is REALLY complicated and I have no idea.

https://unicode.org/Public/emoji/14.0/emoji-test.txt

chrismorgan · on Jan 5, 2022

I believe you’re confusing two distinct concepts:

Sounds like you’re looking at how many scalar values are in an extended grapheme cluster (e.g. U+1F635 U+200D U+1F4AB dizzy face + zero width joiner + dizzy symbol, that’s one “character” made up of three scalar values).

The thing in question here is UTF-8 code units (bytes), and how many UTF-8 needs to encode a single scalar value. UTF-8 needs four bytes for anything above U+FFFF, and almost all emoji are above that.

tomwojcik · on Jan 6, 2022

I've been struggling with fixing some edge cases with emojis in latin1 encoded DB this past couple of days and what you mention here is the most important thing I found yesterday (conincidence).

Unicode strings are made up of Unicode code points, but some code points combine with others to make graphemes (single visual characters). Some emojis, like flags (U+1F1EA U+1F1E6 ), are made of two code points.

EDIT: Flag unicode.

chrismorgan · on Jan 6, 2022

Correction: Unicode strings are made up of scalar values, not code points. The difference is that accursed mess that UTF-16 foist upon us, surrogates;oU+D800–U+DFFF are code points, but are not valid scalar values and cannot occur in Unicode strings (even if things like JavaScript are stupid in allowing potentially-malformed UTF-16, which is incompatible with Unicode, and Python similarly stupid in having strings that are not Unicode but are instead sequences of code points).

tomwojcik · on Jan 13, 2022

Thanks. I think I will need to read a proper book / article on this subject as encoding in general keeps haunting me periodically.

kingcharles · on Jan 5, 2022

I'm confused because U+1F635 can be stored in 3 bytes? I see nothing above 0xFFFFFF which would need the 4th byte? Or am I being dense?

kangalioo · on Jan 5, 2022

In UTF-8, only some bits are used for the actual character codes. The rest are control bits for the decoder to know where a character starts and ends.

https://en.m.wikipedia.org/wiki/UTF-8#Encoding

Dylan16807 · on Jan 6, 2022

To give a slightly more complete answer as to why, utf-8 was designed to be both backwards compatible with ASCII and self-synchronizing.

So you can't use half the values of a byte when encoding code points outside of ASCII, and the remaining 128 values need to be split between starting bytes and continuation bytes. At best you're going to fit about 6 bits into each byte, and unicode spans the equivalent of just over 20 bits, so you need 4 bytes max.

utf-8 opts to keep encoding very simple, and in 2-4 bytes it can store 11, 16, and 21 bits respectively.

If you discarded ASCII compatibility, you could create a 1-3 byte self-synchronizing unicode encoding. But at that point maybe you should be compressing your text instead.

Or you could weaken the compatibility and create "utf-1 but better" to fit into 3 bytes.

chrismorgan · on Jan 6, 2022

> Actually it does seem feasible to make an encoding that's backwards compatible with all printable ASCII characters, and uses 3 bytes for all other code points.

Sorry, you can’t squeeze Unicode in that way, even unsynchronised: you fall just over 2½ bits short. No, reserving the 95 printable ASCII characters scuttles your hopes, meaning you get less than 1½ bits of that byte to use for the rest of Unicode, when you needed a touch over 4 bits.

To represent all the rest: Unicode scalar values minus printable ASCII is U+0000–U+10FFFF minus U+D800–U+DFFF minus U+0020–U+007E: 1,111,969 possibilities, log₂(1111969) ≈ 20.085 bits of entropy. (Removing the 2048 surrogates and 95 printable ASCII characters barely made a dent, under 0.003 bits, but I had to remove it for informational correctness.)

95 printable ASCII characters, log₂(95) ≈ 6.57 bits. Subtract that reservation from the three bytes available and you’ve got about 17.43 bits left for the rest of Unicode, but you needed over 20.

[Edit: I observe you’ve removed what I was quoting before I posted my comment. Still fun mathematics.]

> If you discarded ASCII compatibility, you could create a 1-3 byte self-synchronizing unicode encoding.

Not quite. If you discarded ASCII compatibility, then you’d have almost four bits to spare, which is enough for a self-synchronising fixed-width 3-byte encoding (reserving the first bit of each byte to signal if it’s a start byte) or for an unsynchronised variable-width encoding of some kind; but to make it even 2-or-3 or 1-or-3 (rather than 1–3) bytes wide variable-width, you’d need four bits (3 + log₂ 2) for self-synchronisation, and you’re about 0.085 bits short. (1–3 bytes variable-width would require 3 + log₂ 3 ≈ 4.58 bits.)

—⁂—

I just got nerd-sniped, didn’t I?

Dylan16807 · on Jan 7, 2022

> Sorry, you can’t squeeze Unicode in that way, even unsynchronised: you fall just over 2½ bits short. No, reserving the 95 printable ASCII characters scuttles your hopes, meaning you get less than 1½ bits of that byte to use for the rest of Unicode, when you needed a touch over 4 bits.

So it depends on exactly what we mean by compatible, and that's why I removed it.

But in particular, the encoding I had in mind when I wrote that was one that allows most ASCII characters as the middle byte of a triplet. That way whether you truncate from beginning or end you'll never have a rogue byte successfully decode.

So something like: Lead bytes from C0-FF, middle bytes from 00-BF, final bytes from 00-1F or 7F-BF.

If you also exclude null, tab, newline, and line feed, that gives you 64 x 188 x 93, which is just enough for every code point.

But arguably that's not compatible enough because you couldn't search for single ASCII characters with a dumb byte-wise algorithm.

> Not quite. If you discarded ASCII compatibility, then you’d have almost four bits to spare, which is enough for a self-synchronising fixed-width 3-byte encoding (reserving the first bit of each byte to signal if it’s a start byte) or for an unsynchronised variable-width encoding of some kind; but to make it even 2-or-3 or 1-or-3 (rather than 1–3) bytes wide variable-width, you’d need four bits (3 + log₂ 2) for self-synchronisation, and you’re about 0.085 bits short. (1–3 bytes variable-width would require 3 + log₂ 3 ≈ 4.58 bits.)

I don't know where you got that equation but it's not the right one for the situation.

Here's a really simple encoding just to disprove it: Leading bytes encode the plane (17 options) plus two more bits into decimal values 0 through 67. Continuation bytes encode 7 bits into decimal values 128 through 255. 17 planes + 2 bits + 7 bits + 7 bits, perfect fit. The remaining values, 68-127, can be used for single byte (and two byte) encodings.

But also the definition I was using for self-synchronizing doesn't seem to match the one on wikipedia. My intent was a code where you can guarantee sync after a specific small number of bytes (unlike UTF-1). In particular, if you let standalone bytes and continuation bytes overlap each other, you can still guarantee sync within 3 bytes. This gives you lots of space to play with, and you could make a very useful encoding. You could have >150 single byte codepoints and thousands of double byte codepoints and fit everything else into three bytes.

chrismorgan · on Jan 7, 2022

My (3 + log₂ 2) came from reckoning that you needed one whole bit as a signal in each byte (3), and that the byte-width thing (log₂ 2) was extra to that. As you correctly point out, this was incorrect: the first byte doesn’t need one full bit of entropy to signal that it’s the first byte.

kingcharles · on Jan 6, 2022

I had to Google ⁂ as I've never seen that character before. Fascinating! It makes a nice Christmas snowflake stand-in.

I wonder what HN's rules are with sanitizing Unicode, as they won't allow emojis?

chrismorgan · on Jan 7, 2022

I use “—⁂—” as my horizontal rule in plain text, via my Compose key (Compose+h+r).

HN removes things that are officially marked as emoji, normalises whitespace to space (which annoys me because I sometimes want to type things like narrow no-break space in equations which it makes wider and loses the no-breakyness of), and that’s all I’ve noticed. Control characters probably get some treatment or other too.

erik_seaberg · on Jan 6, 2022

If 21 bits seems like an odd place to stop, it’s because the Unicode consortium said they wouldn’t allocate any codepoints beyond U+10FFFF (the end of plane 16), because that’s as far as UTF-16 could go using a four-byte surrogate pair (10+10 bits, offset by U+10000).

Dylan16807 · on Jan 7, 2022

Not that UTF-8 really stops at 21. As originally formulated it has 5 and 6 byte encodings for 26 and 31 bits, but an entire unsigned int of code points was later decided to be unnecessary, and those encodings were marked as invalid.

Dylan16807 · on Jan 8, 2022

* Positive signed int, not unsigned, oops.

chrismorgan · on Jan 6, 2022

U+1F635 is encoded in UTF-8 as bytes {f0, 9f, 98, b5}.

kingcharles · on Jan 6, 2022

Thank you, that explains it!

geoduck14 · on Jan 6, 2022

>Apple adding those probably caused more upgrades to utf8mb4 than any amount of pleading from languages that actually needed them ever would have.

I think there is a lesson here - maybe I'll find it after my coffee

gpderetta · on Jan 5, 2022

You probably mean outside of the basic multilingual plane. Any non-ascii character is multibyte in utf8.

riffraff · on Jan 5, 2022

I hit the same issue in some app I wrote. It's what convinced me never to use MySQL ever again.

BoxOfRain · on Jan 5, 2022

It's stories like this that stop me complaining online about a long day of futile debugging!

somat · on Jan 5, 2022

What I am curious about is why couldn't they fix it?

At one point the change was made from 6 to 3 bytes, I expect this made the on disk structure incompatible with prior versions. why were they able to make this change but unable to make a later change say from 3 to 4 bytes.

I am not as familiar with mysql but note that postgres major versions are not disk compatible with each other(you have to dump and restore). I expect this is the exact reason. you want a system where it is possible to make changes.

see also: the drama around linux abi changes vs openbsd abi changes.

To meme off of raul julia: openbsd to linux: "For you, the day your abi changed was the most important day of your life. But for us, it was Tuesday."

evanelias · on Jan 5, 2022

VARCHAR(N) can store N characters. So with utf8mb3, that's a max of 3N bytes worst-case. But if these columns were silently/automatically converted to utf8mb4 upon upgrading MySQL, they would now need to store max 4N bytes, which (with a high N) can exceed internal limits such as maximum length of an index key.

IIRC, there were additional problems in older versions of MySQL, situations where sort buffers were sized to a fixed length equal to the value's worst-case size or something like that. So sorting a large number of utf8mb4 values would use a lot more memory than utf8mb3 values (again, iirc, I might be wrong on this).

So the safer and more backwards-compatible approach was to introduce utf8mb4 as a new separate charset, and allow users to choose. MySQL 8 is now transitioning towards deprecating utf8mb3, and will finally make the utf8 alias point to utf8mb4 sometime in the near future.

That said, there are still a bunch of unpleasant uses of utf8mb3 internally in things like information_schema. I develop schema management tooling and recently lost a week to writing work-arounds for some of the more obscure ones, see https://github.com/skeema/skeema/commit/bf38edb :)

chrismorgan · on Jan 5, 2022

> they would now need to store max 4N bytes, which (with a high N) can exceed internal limits such as maximum length of an index key.

Specifically, indexes are limited to 767 bytes (some say 768?), which means 191 utf8mb4 characters. So your otherwise-innocuous VARCHAR(255) primary key breaks. I’ve seen this multiple times, e.g. django-celery probably still has this: https://github.com/celery/django-celery/issues/259.

evanelias · on Jan 5, 2022

fwiw the limit is effectively 3072 bytes nowadays, default since MySQL 5.7.

You could actually bump it up to 3072 in 5.5 and 5.6 too, but it required changing several settings (innodb_large_prefix, innodb_file_format, innodb_file_per_table, and the table's ROW_FORMAT).

chrismorgan · on Jan 6, 2022

Cool, didn’t know that; thanks for the correction. I haven’t touched MySQL since 2016, and I suppose what I worked with then wasn’t 5.7, or was configured differently for historical reasons, or something.

lalaithion · on Jan 5, 2022

What's the problem with allowing VARCHAR(N) store a max of 3N bytes worst-case, but still allow the 4-byte characters to be present in the string? Sure, it means that you can only store 0.75 * N emojis in a VARCHAR(N), but that seems less bad than the alternative of silently corrupting data.

evanelias · on Jan 5, 2022

4-byte characters are simply not valid in the utf8mb3 character set. The column type defines what values may be stored in the column, and the character set is inherently part of the column type for textual data.

So your question is similar to asking, what's the problem with allowing storage of negative numbers in an unsigned int column. The answer is those bytes inherently have a different meaning, or may be completely invalid, depending on the column type.

> that seems less bad than the alternative of silently corrupting data

MySQL has defaulted to a strict sql_mode out-of-the-box since MySQL 5.7, released over 6 years ago. With strict sql_mode, no silent corruption of data occurs. If you attempt to insert a 4-byte character into a utf8mb3 column, your insert fails with an error.

Everything older than 5.7 has hit end-of-life, so there is literally no supported version of MySQL which doesn't default to enabling strict mode.

The only problem is that database-as-a-service providers use nonstandard defaults, so e.g. AWS RDS doesn't come with strict mode enabled, for whatever completely insane reason. But that's technically an Amazon problem, not a MySQL problem.

chmod600 · on Jan 5, 2022

"those bytes inherently have a different meaning"

Are you sure? Utf-8 is a variable-width encoding so it should just work, right?

evanelias · on Jan 5, 2022

Yes, you're correct that utf8mb3 is strictly a subset of utf8mb4, and the binary value of any valid utf8mb3 character is identical to its utf8mb4 representation. But I'd argue that's irrelevant.

I meant my statement generically: the meaning of the bytes stored in a column value depend entirely on the column type, as does the range of values that may be stored or converted.

If a column's type is, say, VARCHAR(100) CHARACTER SET utf8mb3, this means the column can store up to 100 utf8 characters and each character has a maximum length of 3 bytes. It is simply not legal to store 4-byte characters in this data type, just as it isn't legal to store 101 characters in it.

The proposal in this subthread seems to be, instead of introducing utf8mb4, why didn't MySQL allow "utf8" (utf8mb3) columns store 4-byte characters, just without affecting the maximum number of bytes that can be stored in the column? So, a VARCHAR(100) CHARACTER SET utf8 might only be able to store max 75 characters if they're all emoji. Additionally, it would mean that the range of values allowed in a column would depend on the DB server version, and would change upon upgrading. This just seems completely counterintuitive and wrong to me, relative to how all other column type definitions work. The modern progression of MySQL has been to remove these obvious-WTF problems, not introduce more of them :)

The proposal was described as "seems less bad than the alternative of silently corrupting data", but that isn't the alternative (silent data corruption doesn't occur with strict mode which is enabled by default), so this comparison doesn't make sense.

kevin_thibedeau · on Jan 5, 2022

Everything that decodes that utf-8 has to support 21-bit codepoints. If they used a uint16_t anywhere in the processing chain then it's restricted to the BMP.

emmelaich · on Jan 5, 2022

Probably just better off storing as a blob and doing all interpretation / normalization outside the db.

morgo · on Jan 5, 2022

You are correct, it is a disk-format issue, and MySQL officially supports in-place upgrade between versions.

It's actually quite hard to fix bugs in charset/collations, because any changes to the sort order implicitly could affect the on-disk format (indexes are sorted).

Karellen · on Jan 5, 2022

* "Why MySQL/MariaDB's 'utf8' is not UTF-8"

flokoe · on Jan 5, 2022

I changed the title to be more specific. Thanks!

koolba · on Jan 5, 2022

It really should have MySQL in the title as it’s not a generic database abnormality. It’s specific. To MySQL (and by extension MariaDB).

petee · on Jan 5, 2022

The 2002 diff got me interested in the context of why that change was made, which lead me to this 2006 bug which I think might shed light on why '3' was chosen:

"The UTF-8 encoding is a variable byte length encoding. The ASCII subrange of characters is one byte long, alphabetic writings 2 bytes long, Far Eastern writings 3 bytes long per character (and some rare historical writings are mapped using 4 bytes)."

https://bugs.mysql.com/bug.php?id=18560#c61151

UTF-8 is still something I'm learning the details about, but it looks like back in 2002 they simply didn't see the need for the supplemental plane characters, ie, playing cards, arrows, or ancient Greek numbers. It might be short sighted but we shouldn't judge so far in the future, recalling that emoji wasn't even in Unicode until 2010.

tialaramex · on Jan 5, 2022

If it was called just "string" or something and then it quietly explained somewhere that it's actually only the BMP and not all of Unicode that's not so surprising, but they deliberately called it UTF-8 even though it just isn't.

Imagine if you bought a USB mouse and then you discovered, oh, they didn't bother implementing USB, this is just a PS/2 mouse with the USB connector. Works fine so long as you use the supplied PS/2 "adaptor" but it isn't actually USB even though it says it is on the box. That mouse is defective. MySQL is defective as supplied. Which won't come as a surprise to many experienced MySQL developers.

I actually rather like MySQL, but it's like bash or Excel, it is a good tool for quick estimates or prototyping, and a poor tool for actual engineering. For the cost of all the work needed to get reliable high quality results with MySQL you could have used a different database where the baseline effort to get it working was higher but what you get is more solid, and "utf8mb4" is an example of that extra work in MySQL.

petee · on Jan 5, 2022

While searching yesterday, I recall the mysql documentation did specify this this quirk.

As I gather, this change happened prior to or at the change to 4 char max utf; was supplemental plane support required at that time? 3 char is still utf8, just not fully supported, but I really do not know.

Not to defend, just saying things were new and fluid back then, though it should have been fixed properly alot sooner.

tialaramex · on Jan 6, 2022

In Modern MySQL the documentation does - if you look - tell you that utf8mb4 exists to actually do UTF-8 and if you think to check it mentions that their utf8 is not in fact UTF-8 but only "utf8mb3".

But fifteen years ago most of that documentation doesn't exist, and utf8mb4 doesn't exist, and if you either read the MySQL documentation carefully and realise "utf8" isn't UTF-8, and only encodes the Basic Multilingual Plane, you are stuck anyway, MySQL doesn't have a sensible way forward for you.

I would argue that by 2002 when this bug (let's call it what it is) was introduced, the writing was on the wall, UCS-2 is dead, which means the "it's just easier" justification for choosing 16-bit is also dead, so UTF-8 is the natural choice for software that isn't meanwhile committed to a 16-bit encoding anyway (as Java and Windows are). That makes shipping UTF-8 a good idea, but it makes choosing utf8mb3 a really weird idea that already seems certain to cause pain.

petee · on Jan 6, 2022

Could be. And its weird especially when supporting all 4 bytes would be easy (so it seems). For fun I did dig around though, and found that even in the earliest documentation they clearly state only supporting 3 bytes, though not why. It says this in 2005 which only justifies not supporting 5/6 bytes, but not four. So confusing:

"RFC 3629 describes encoding sequences that take from one to four bytes. Currently, MySQL UTF8 support does not include four-byte sequences. (An older standard for UTF8 encoding is given by RFC 2279, which describes UTF8 sequences that take from one to six bytes. RFC 3629 renders RFC 2279 obsolete; for this reason, sequences with five and six bytes are no longer used.)" ...Wha?

(2003/2005, they changed urls)

- https://web.archive.org/web/20031211033604/http://www.mysql....

- https://web.archive.org/web/20050226012911/http://dev.mysql....

I personally think anything critical that was worth storing in a new character format still should have been properly tested and documentation consulted, that is on the end-user. Anything not critical where you wouldn't expect to care...still should have tested, but yeah that sucks.

chmod600 · on Jan 5, 2022

What we can judge harshly is that it accepted the 4-byte sequences and failed silently, corrupting data.

A sane system would reject the 4-byte sequences so you could try again without the emojis.

But if users got errors, they might investigate, and it would slow down the rapid growth mysql was aiming for. So they just decided not to throw errors.

petee · on Jan 6, 2022

I agree, although the change was prior to 4-byte utf (2002 vs 2003) the same issue would have manifested with 6 byte sequences. Very curious.

It would be really interesting to see a postmortem by mysql if ever possible, because the more I think about, 3 is such an odd choice (pun), and people usually do most things for a reason haha, its either lost to time, or a brainfart off-by-one.

Diggsey · on Jan 5, 2022

UTF-8 used to support up to 6 bytes to encode a unicode scalar value. Only later was the unicode range restricted such that a max of 4 bytes was required to encode any scalar value using UTF-8.

It looks like when UTF-8 was originally added to MySQL it still supported up to 6 bytes per "character", so the decision to restrict it was actually made by both MySQL and RFC 3629 - unfortunately they chose different limits :(

tialaramex · on Jan 5, 2022

What happened is Unicode committed forever to a range zero to U+10FFFF and RFC 3629 follows that. You don't need 5 or 6 byte UTF-8 to express this range, and so they're forbidden.

So RFC 3629 isn't choosing arbitrarily, the choice in MySQL isn't entirely arbitrary either, it's just wrong (and should not have been named "utf8").

You could use the 3-byte UTF-8-like encoding MySQL has to do CESU-8 but that's not UTF-8. If MySQL wanted to do CESU-8 they could have called the type cesu8, giving away what's actually going on and necessitating a lot of extra work by MySQL. They did not choose to do that.

Diggsey · on Jan 8, 2022

Right, but my point was that the "U+10FFFF" limit hadn't been decided yet at the point where MySQL first added the "UTF-8" encoding, and it was added later, so it may not have been clear to them that UTF-8 would ever be a "fixed target" so to speak.

In retrospect we can see that UTF-8 has been and will continue to be static, since it's become so universally adopted.

JoeAltmaier · on Jan 6, 2022

Speaking of expensive screwups, one of the most expensive schematic errors ever was the routing of the interrupts on the original IBM PC. It put the serial interrupt on the lowest-priority interrupt pin, instead of the highest. This spawned a trillion-dollar 'communications card' industry that was arguably not necessary. Certainly not necessary to the degree it grew.

Interrupts should be prioritized by latency. Instead, the designer prioritized them by, I don't know, their most favorite to least favorite? E.g. the disk transfer completion interrupt was very high, even though it literally has infinite latency (nothing irrecoverable is going to happen if it's delayed indefinitely).

If I had a time-travel machine that only allowed a 1 centimeter window into the past, I would open a window to that original schematic and swap the disk and comm interrupt lines. That would have changed silicon valley history.

marginalia_nu · on Jan 5, 2022

> Regarding the different collation algorithms: The unicode variants should be slower than the general variant due to the more complex algorithms, but this was years ago when computers were much weaker than now. With modern hardware, the speed gain should be hardly notable.

...

> Furthermore, we should prefer accuracy over speed. There is no need to use an old and quirky collation like utf8mb4_general_ci with modern hardware.

This is kind of oversimplified and depends a lot on your use case. Performance can absolutely matter even with modern hardware. There is no one-size-fits-all collation that is best in all scenarios, if there was, we really wouldn't need the option to choose.

If you have a thousand rows in your table, sure, use whatever. If you have a billion rows in your table, things are different and you may not be able to afford leaving performance on the table.

flokoe · on Jan 5, 2022

Thanks for your critique!

I admit it may be a simple view, but I still believe it is the best choice for most of us.

However, if you are in a position where you have to handle such an extensive dataset, you probably know about this detail and can decide if this is a suitable option or not.

alserio · on Jan 5, 2022

> One last concern is the size stored on the disk because UTF-8 supports four bytes per character. But since UTF-8 is a variable-width character encoding, it only uses as many bytes as needed. For example, the first 128 characters like ASCII only use one byte

I'm not sure this is true if you use CHAR instead of VARCHAR. Moreover if I'm not misremembering indexes still store the full fixed size 3 or 4 byte representation and index size might matter. However, I'll still use mb4 to avoid the fun times other have already described in the comments

tzs · on Jan 5, 2022

> For example utf8mb4_general_ci does not knwo how to sort s and the German character ß (“sharp S”) in contrast to utf8mb4_unicode_520_ci which sorts ß just fine

If I use utf8mb4_general_ci now, is it guaranteed to produce the same results on a given dataset each time, and to keep producing those same results under future versions of MariaDB?

Or is using it on data that contains characters for which it does not define a sort order like using undefined behavior in C, and it is free to produce different results from run to run and from version to version?

avgcorrection · on Jan 5, 2022

> Even though 128 characters are enough to cover the English language and a few basic symbols

You can communicate just fine in English using ASCII. But plenty of English words use non-ASCII letters.

DemocracyFTW · on Jan 5, 2022

Æsthetics do play a rôle here

kingcharles · on Jan 5, 2022

To think otherwise would be naïve.

captainmuon · on Jan 5, 2022

I understand why this would affect collation and indices, but why can't I even store the four byte characters in an utf8mb3 table? The SQL parser and printer seems to deal with the characters fine, because it is independent from the table. I would assume it would just stuff the UTF-8 bytes in there as they are. Is it trying to validate them? Or is it using UTF-32 (or a truncated version) internally and saving everything as 3-4 bytes?

TedShiller · on Jan 5, 2022

aka why I switched to Postgres

dmz73 · on Jan 5, 2022

Does posgres support case insensitive collation on standard varcharrs? Last time I used it that was not an option so it made everything a lot more complicated. Postgrges also did not support hints for queries so tweaking performance was more complicated. Both MySQL and postgres performed much worse with complex queries than MSSQL even after tweaking.

tpoacher · on Jan 5, 2022

Is it me or does the author seem to be using utf8 and unicode interchangeably (at least in the intro)?

orthoxerox · on Jan 5, 2022

I wonder who that "bar@bar.mysql.r18.ru" was.

chrismorgan · on Jan 5, 2022

utf8mb3 allows representing up to U+FFFF, but not U+10000–U+1FFFF.

It’s like UCS-2 is to UTF-16, only without any actual technical rationale.