Why does HTML think “chucknorris” is a color?

dzdt · on April 15, 2017

Fun quiz. Go to a page like http://codepen.io/tholman/pen/EwlKd. Try the following colors. Without looking it up, which ones are real named colors and which are artifacts of the bizarre scheme?

"fire" "campfire" "firebrick" "firefly"

"cornflower" "cornflowerblue" "cornflowerred"

"seablue" "skyblue" "seagreen" "seafoam"

Grue3 · on April 15, 2017

Never heard of this before. The described algorithm for parsing color codes seems absolutely bizarre. What could've possibly been a rationale for it?

scott00 · on April 15, 2017

Maybe we'll get some old hands to tell the real story, but here's my guess: early web stuff all tended to be very permissive of the inputs, the idea being showing a user some weirdly formatted half right page was better than showing them "html parse error" and nothing else. I suspect the guy who wrote the color parser wanted to do something other than fail on invalid color strings and just coded up the first thing he thought of to handle the invalid input cases.

yuhong · on April 15, 2017

The original algorithm dates back to something like Netscape 1.1 I think. I think they added color names in Netscape 2.

ubernostrum · on April 15, 2017

HTML was originally formalized on top of SGML, and in a way which was permissive (SGML allowed this, because it was supposed to be a flexible generic way to specify markup languages). This meant, for example, that you could omit closing tags for many elements, and something that processed HTML was supposed to know the rules for when and where to infer the existence of a closing tag. There were also whole elements which could simply be omitted -- if no tags indicated where they were or what they contained, they too could be inferred.

The permissiveness had a drawback, though, which was that many people didn't learn correctly what could and couldn't be omitted, or what was and wasn't required. And so badly-formed HTML dominated the Web. Browsers had to bend over backwards to make sense of this and display something, since at the time there was heated competition in the browser space: if Netscape just bailed out with a parsing error and IE didn't, people would use IE because they'd blame the failure on Netscape instead of on the HTML (same thing for IE -- it couldn't bail out with an error because Netscape might look better in comparison by rendering the page).

XHTML -- HTML reformulated on XML instead of SGML -- tried to bring XML's strict well-formedness handling to HTML, but turned out to be an utter failure. See Evan Goer's classic "XHTML 100" for an example of how even experts routinely could not do XHTML properly and had to be saved by browsers which never really applied the strictness correctly:

http://goer.org/Journal/2003/04/the_xhtml_100.html

There was an attempt at XHTML2, but it collapsed under its own ambition.

Enter HTML5, which began as a response to the perceived stagnation of W3C in iterating the HTML standard (since much of their effort was being expended off in markup-astronaut space on XHTML2). HTML5 introduced a bunch of useful things people had been asking for for a while, but also took the approach of codifying how to parse HTML. And not just valid HTML, but anything claiming to be HTML. If you run the HTML5 algorithms, you can get a parse tree out of damn near any kind of junk, and that process is now standardized. In pragmatic fashion, it mostly codified the hacks and workarounds browsers had already come up with.

And that's how we get the HTML5 legacy color parsing algorithm, which can turn almost any junk you stick in a declaration into an RGB color value. Most of what it's doing is trying to filter out obvious cases (matches for named colors and certain hex specifications) early, then figure out a way to turn random junk consistently into a hexadecimal number.

If you're interested, I maintain a Python library for working with HTML/CSS color values, and an implementation of the legacy color parsing algorithm with directives from the spec interspersed as comments so you can step through it and see what's going on:

https://github.com/ubernostrum/webcolors/blob/master/webcolo...

Its documentation also covers the history of how colors are specified in HTML/CSS:

http://webcolors.readthedocs.io/en/1.7/colors.html

But a simple description of the way it works is:

1. Bail out if the value is not something the algorithm can work with. Non-Unicode inputs can't be parsed into a color, and an empty string can't be parsed into a color.

2. Strip leading/trailing whitespace, and look to see if the input matches a CSS named color, or the three-hexadecimal-digit shortcut format of CSS2, shortcutting to parsing via those rules if possible.

3. Run the junk normalizer. This is where it looks complicated, but mostly it's just finding anything that can't possibly be part of a hexadecimal number and replacing it with one or more zeroes, then doing some padding and truncation to get the end result to come out as a hexadecimal string specifying a 24-bit integer. Then that 24-bit integer is the color value.

ubernostrum · on April 16, 2017

Just for sake of completeness, a walk through the parsing step-by-step shows nothing actually changes until step 10, when the algorithm does its second character-replacement pass, replacing characters that aren't hexadecimal digits (they get replaced with zeroes; the first pass did this to any characters outside of Unicode's Basic Multilingual Plane).

At that point 'chucknorris' becomes 'c00c0000000'.

In step 11, the value is padded with zeroes until its length is a non-zero multiple of three. At that point it becomes 'c00c00000000'.

In step 12, the value is split into three equal-length sub-values which will become the red, green and blue components of the color value. They are: (c00c, 0000, 0000).

Steps 13 and 14 then attempt to truncate those values, first by reducing them to eight characters in length and then removing leading zeroes (but only if all three sub-values have a leading zero). These make no changes for the sub-values we've arrived at.

Step 15 performs the final truncation: if the sub-values are still of length greater than two, all but the first two characters are removed. At this point the sub-values are: (c0, 00, 00).

Finally, steps 16-20 convert these from hexdecimal to decimal, and return the value: (192, 0, 0).

yuhong · on April 16, 2017

No browsers ever actually parsed HTML using SGML. It is still possible to edit current HTML with SGML editors with some limitations.

ubernostrum · on April 16, 2017

Whether anybody parsed it that way is irrelevant; the point is that a lot of the inherent sloppiness came from the SGML roots.

re · on April 15, 2017

Back when HTML only had a handful of basic color names ("red", "green", "gray", etc.), I recall there being a browser that, if you tried to use the spelling "grey" would show the color green instead, since that was what it matched based on the first three letters of the word. The question reminded me of that, but I can't find any sources for it now--I guess it's a 20-year-old problem at this point.

ubernostrum · on April 16, 2017

As of CSS3 there's no workaround needed; CSS3 adopted SVG's named-color palette, which defines both "gray" and "grey" and has both of them become rgb(128, 128, 128).

The color-parsing algorithm which turns "chucknorris" into a color also will return rgb(128, 128, 128) for either spelling, since it specifies that any value which is a case-insensitive match for a CSS3 named color should just short-circuit resolve to that color.

mathias · on April 16, 2017

Wondering what a given legacy HTML color value (as seen in `bgcolor`, `text`, `link`, `vlink`, and `alink` attribute values) looks like? This tool (which I made for a presentation years ago) shows you: https://mothereff.in/bgcolor#mathiasbynens

aaron695 · on April 16, 2017

Chuck Norris supports creationism in schools and is not big on the gays mixing in society.

It's fine to love his movies, politics don't effect art in that regard, but if you think he's cool and worthy of a meme, I guess fair enough, as long as you're aware of his politics and consider it to not effect memes should you disagree.