Hacker News new | past | comments | ask | show | jobs | submit login
Why the #AskObama tweet was garbled on screen (hanselman.com)
322 points by brianwillis on July 7, 2011 | hide | past | favorite | 76 comments



"This is SUCH a classic sloppy programmer mistake that I'm disappointed"

Oh come off of it. This happens everywhere on the web on probably something like 25% of websites. And it's NOT always the consuming program's fault: very often somebody upstream, e.g. the hosting company, the person that wrote the HTML, the source of an RSS feed being inserted into the page etc. etc. forgot to encode something the way somebody else expected, and you as the poor guy at the end of the chain gets a document with multiple encodings improperly embedded into it. Inevitably you have to make some bad decisions and not all corner cases are handled.

Somebody once reverse-engineered the state chart for how Internet Explorer handles documents with conflicting encoding declarations and I kid you not, it must have had >20 branches spanning a good few pages. Officially, the correct order of precedence is (http://www.w3.org/International/questions/qa-html-encoding-d...):

1. HTTP Content-Type header

2. byte-order mark (BOM)

3. XML declaration

4. meta element

5. link charset attribute

but that's not how every browser does it, because the W3C sort of declared that after things on the Real Internet (TM) had already gotten out of hand. I hate to resuscitate Joel posts but Unicode is not easy to implement right.


Twitter said that the message in question is UTF-8. The message recipient decoded the message with something that is not UTF-8.

That's a sloppy mistake and should without doubt have been caught in testing.


The company responsible seems to have responded in the comments:

"It was definitely a mistake on our part. The problem was not the encoding on our data feed, but the HTML document was sent with ISO-8859-1. The second we inserted the twitter text into the DOM, the browsers interpreted the UTF-8 string as ISO-8859-1. Our visualizations are hosted on other platforms, and in this case the server was not configured to send UTF-8 with text/html even though the HTML file was encoded as such. It was the only issue (albeit a pretty obvious one) during an otherwise flawless event. I apologize to President Obama, Speaker Boehner, and Jack Dorsey for the mistake. If the readers of the blog think it was stupid, imagine how we felt. dev environment != production environment. If we would have just included a <meta charset="utf-8"> in the HTML head, then this would not have occurred.

The big take away is don’t make assumptions about other platforms (especially when it comes to encoding), and always include charset meta tag." [emphasis mine]


Was emailing involved? We've been getting so much support questions regarding encodings of HTML emails (we're a programmatic email service: http://mailgun.net) that we actually decided to become the first MTA to enforce utf-8 transcoding on the entire traffic: randomness in -> clean utf8 out and everyone is suddenly quiet&happy.

Something like this could be done as nginx/apache module: which detects encodings of the data and transcodes the HTML output into utf8 - could be useful for some cases.


Including <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/> is like the first thing you do when starting front-end development.

Every time you don't validate...God kills a kitten. Please, think of the kittens.

This has been a Public Service Announcement. Please code responsibly: http://validator.w3.org/


Just for the record, the correct spelling is now:

  <meta charset="utf-8">


How do you (or the OP) know the implementation details of the consumption path for display? The author assumes it was the direct JSON feed from Twitter. That's overly simplistic and betrays his inability to understand how these sort of things are put together. This is a public, real-time event; more likely than not tweets were filtered through one or two other components to separate the interesting material from the noise. You have no idea what those intermediary components were, who set them up, and if they were all decoding/reencoding correctly. Perhaps they were going through other Twitter users' clients, which could have contributed to the bug.

Edit: yes, there were definitely people curating the questions, so it could not have been as simple as pulling one JSON feed. http://www.theatlantic.com/politics/archive/2011/07/how-obam... https://twitter.com/#!/townhall/july-6-curators


> You have no idea what those intermediary components were, who set them up, and if they were all decoding/reencoding correctly.

...That's kind of the point. There was someone being sloppy there.


Yup, it can also be much more serious than some garbled text. Google "canonicalization vulnerabilities" and see stuff like this: http://www.amarjit.info/2009/09/canonicalization-vulnerabili...


Actually, the correct order is defined in the HTML5 spec now: http://dev.w3.org/html5/spec/parsing.html#determining-the-ch...


Really, it's not that hard. 99% of the job is done for you. Just use some UTF-8 encoded text to test your app before letting the president of the United States use it.

There is no excuse for not testing your app for basic i18n brokenness. It's 2011, not 1998.


Maybe I'm just a terrible programmer, but I think the author may be a little over emphasizing the seriousness of the bug. To me, this is one of those throw away issues that you keep in the back of your mind. Unless I'm coding something that is extremely datacentric and critical in that sense. But it's not like its on one of my "top 20 must run tests" or anything. It's always one of those issues I ignore or assume is correct until I find out it isn't. When I find out, it's simply a matter of tossing in a code page translation at either the input or output end and I'm done with it.

Or maybe I've just been fortunate enough to be in an environment where an occasional goof of this caliber doesn't have any serious consequences.


> Or maybe I've just been fortunate enough to be in an environment where an occasional goof of this caliber doesn't have any serious consequences.

Primarily, this means you don't have to support internationalization - which is hardly a bad thing, especially if you work on a startup, where worldwide distribution should be the last thing on your mind. When your product is rendered in over 80 scripts, including right-to-left languages, you can't afford to figure encoding will sort itself later.


And Hanselman works for Microsoft, where i18n is a big deal. So yes, for someone who's been at MS for a while i18n related issues become second nature. But if you typically are only targeting the United States, it's more understandable to not have these things on the brain.


On the other hand, Outlook still has a ridiculous bug that uses the wrong encoding when presenting HTML email - that is, it uses the encoding of the email's text-body when presenting the html-body, even if the html-body specifies a different encoding.

So, if the two bodies have differing encodings (charsets), then the HTML body will look wrong. Unless you force Outlook to always use UTF-8 for encoding emails (which is a setting, but not the default) then you'll end up sending emails that will look garbled to your recipient.

This "differing charset" scenario actually happens pretty frequently, because of the following scenario:

a) You write an email (or reply to an existing email - actually it happens most with replies).

b) Outlook's text editor decides to insert a non-breaking space (codepoint U+00A0). Perhaps it generates HTML with &nbsp; but before transmission this eventually turns into the single UTF-8 character 0xC2 0xA0.

c) When generating the text-body, Outlook decides to just use a plain old space, so the text body is plain ASCII.

d) Outlook, in its cleverness, then says "ooh, I can 'conserve' encoding-ness and use plain old iso-8859-1 for the text body, but I need to use UTF-8 for the HTML body because of that non-ascii character"

e) Outlook generates this email (please excuse formatting woes due to HN).

Content-Type: multipart/alternative; boundary="0016e64dbd929784310488b2b082"

This is a multi-part message in MIME format.

--0016e64dbd929784310488b2b082

Content-Transfer-Encoding: 7bit

Content-Type: text/plain; charset="ISO-8859-1"

yo yo

--0016e64dbd929784310488b2b082

Content-Type: text/html; charset="UTF-8"

Content-Transfer-Encoding: quoted-printable

<html> <body>

yo=C2=A0yo

</body> <html>

--0016e64dbd929784310488b2b082--

When you view the above email in Outlook, you see "yo yo" instead of "yo yo"


You should file this at connect.microsoft.com. MS devs and PMs really do read and triage bug reports coming from there. The more details you put in the bug report, the better.


You're not a terrible programmer, just an uninformed and/or slightly Anglocentric one.

Just last week, my co-worker had to waste two days debugging a two-year-old Sphinx setup (the person who implemented it no longer works with us) because a Japanese user of our blogging service wasn't seeing the post he was looking for in our search feature. The problem was that the conduit feeding the Sphinx indexer was handling Unicode incorrectly (to be specific, it was deleting certain bytes wholesale because this guy believed them never to be valid, and this broke multi-byte sequences horribly). Those two days would have been much better spent working on his current project.

Additionally, not handling Unicode correctly can leave you open to certain types of security holes!

The biggest of these is probably that Unicode means _a string is not an array of bytes_, so naïve allocators for languages with byte-array strings (read: C and its brethren) are susceptible to buffer overruns when handed multi-byte Unicode sequences when they're expecting ASCII.

Here's another one: the Unicode character space contains many, many glyphs that look almost (or in some cases exactly) like ASCII characters. RFC 3492 defines a framework for internationalized domain names (IDN) that encodes non-ASCII code points using ASCII characters, and the most common implementations of this transparently go between the two. This means that you could register "bаnkofamerica.com" (actually xn--bnkofamerica-x9j.com) and put a phishing site there, and people would happily click on the identical-looking URL and give you their bank account. This was pointed out several years ago and most modern browsers have mechanisms in place to defend against it, but your custom application might not unless you're careful to check what you're doing.

There are plenty of blog posts and articles out there designed to tell you how to be safe when dealing with Unicode (and you should assume that you will be). I highly suggest you go read one.


Wrong. By neglecting to properly process encoding, you shut your application off from everyone except the English-speaking population of the world (and even so, with bugs like this one).

Such mistakes were excusable in the 90s. They're the sign of an amateur programmer today.


You're a fine programmer, but you're terrible at UX. I can put up with this sort of thing in a command-line application, but in any other context it's at least as distracting as non-standard GUI widgets or broken color management. Although these are cosmetic rather than functional issues, they're terribly distracting ones.


I wouldn't take it seriously. The article is written very tongue in cheek. If you follow Hanselman you'll see he's a funny guy.


'So do political nerds get to moan because the author referred to Boehner as "the Senator"?'

https://twitter.com/douglas/status/89080894018686976


Touchée.


Are you female and awesome, or was that just a typo?


I am neither female nor an awesome typist. Actually, I failed to remember the alt-key combination for the é, and then just pasted it in from charmap instead. I wonder why it is that my cellphone had more advanced text input facilities than my desktop. I know there are 3rd party utilities for bringing up special characters if you hold down a key and so on, but you'd think this sort of thing would have been standardized by now.


I can confirm that anigbrowl is not a female.


Ah, just a typo then.


Hey I work for the company responsible for the visualization behind the president and the content on http://askobama.twitter.com

Let me take this very excellent opportunity to say that we are looking to hire a full time "front end" developer. You'll get to work on badass projects like the Obama Town Hall. Ideally, you'd be located in Austin. Find me on Twitter @efalcao to learn more.


FWIW, this was an intense project to pull off. 1000's of tweets per minute from Twitter, 8000 requests per second on http://askobama.twitter.com (where the same tweet was also delivered by us and rendered correctly).

We're not lazy or sloppy... It basically boiled down to one server sent down the right header...the production one didn't.

Unicode issues are sorta in the class of "gotcha" issue. They happen, you go "oh shit" and fix them right away. Our "oh shit" moment just happened to come at the most intense possible moment....in front of the president, with so many watching.

Wanted to reiterate once again: We're Hiring! @efalcao on twitter. Early stage startup looking for exceptional talent.


I think you did a great job, way to represent the Austin Tech scene!


Thanks so much!


tl;dr:

  $ python
  >>> print u"\u2019".encode("utf-8").decode("Windows-1252")


Also interesting, the Unicode Nazi:

  http://pypi.python.org/pypi/unicode-nazi


Did they see the encoding mistake before they showed it?

Because I wonder how difficult it would be to create a string that says something innocuous in UTF-8 (e.g., "When will you bring the troops home #AskObama") but in ASCII would read as something totally different, but legible (e.g., "the secret priests would take great Cthulhu from his tomb to revive His subjects and resume his rule of earth...")


I imagine quite difficult, as each character triplet when decoded with Windows-1252 would have to be one letter in Unicode, and those would have to actually form words. You'd be restricted to maybe 30 triplets.


If you are talking about ASCII(7 bit ascii and not extended ascii), then it can't happen - the encoding is same for 0-127.


Errors like this is what me and my coworkers jokingly refer to as US-UTF8 (no offense meant). In a country that's dominated by ASCII, "supporting" UTF8 means "emitting the same data as usual but declare it as UTF8).

Sure there might be some misunderstandings with special punctuation characters as evidenced by the article, but such issues generally get low priority.

In countries where the language isn't representable in ASCII, we can't use US-UTF8, but have to resort to "real-UTF8" which means dealing with legacy systems that don't do UTF8 (which is what happened in the article we're currently commenting on), dealing with browsers who lie about encoding, and dealing with the fact that a string length isn't its byte length any more even if it doesn't contain "fancy" punctuation characters.

All that makes me wish I could do US-UTF8 too :-)


Proof that even completely ordinary string data used in the most USA-centric domain imaginable STILL needs proper encoding.


I wouldn't call the right single quotation mark "completely ordinary string data." As far as I know, the only way it would ever get into a tweet is through some "smart correcting" client, or purposeful manual entry.


A healthy amount of the writing of the world still begins life in a word processor.

Any string born of this heritage is likely to have single and double (curved) quotation marks. OpenOffice, Word, Pages – they all do it and are expected to.

For that reason, I consider them to be reasonably present in ordinary string data.


What a cool shot. The prez with a common bug on his screen. And a bug I can fix! Still, even though this is an easy fix, he's going to need to open up a ticket.


My only regret is that I have but one upvote to give to this comment.


I find it infuriating that this sort of thing is still a problem. I'm constantly seeing mangled apostrophes in places like Google reader too.


I would personally blame Wordpress for substituting a common apostrophe ' as the left ‘ and right quote ’ respectfully. Same with quotation marks “ and ” instead of the traditional ".

There is an option to not use "Smart Quotes", but it seems to be enabled by default.


Wordpress is for writing, and writing should be properly typeset, and properly-typeset text has proper quotes.

BTW, a note on the term “smart quotes”: that originated when word processors became “smart” about transforming the easy-to-type (but incorrect) ' and " to their proper equivalents automatically. The quotes themselves aren’t smart…they’re just quotes.

Typography nerd out.


For typography, yes.

For many blogs in the hacker community, source code snippets inside <code> tags can also be given "smart quotes", which completely breaks any strings that may be present.

You forget that Wordpress is used for many people outside of the writing community. When writing, and if the writer cares about having their word properly typeset, then the author can do so themselves. Wordpress tries to be smart about it, and covert them, but many people do not care about such features. The developers, however, do.

Also, may I remind everyone, downvotes on HN are not for disagreement, they are for factually incorrect statements.


IMO it’s not bad that Wordpress auto-educates quotes, it’s that it does it even within code blocks. Which is inexcusable, for precisely the reason you mention.

Markdown + SmartyPants are a better solution IMO. (And you can install WP plugins that do this and that disable Wordpress’ default quote educator.)


Part of the problem is that UTF-8 makes things really, really simple, and bulletproof, and then people have to go and create problems again.

Listen. Any time you use an encoding other than UTF-8, you are creating incompatibilities. If your stated intention is to facilitate communication, you are failing. You are a bad person. Stop doing it. The only possible excuse for using a non-UTF-8 encoding is to frustrate communication.

(It's too fucking bad HTTP mandates that the default charset is ISO-8859-1.)


Zed Shaw, can you make a post called "Programmers Need To Learn Unicode Or I Will Kill Them All"?


Joel Spolsky wrote something that comes close, sans the threat:

http://www.joelonsoftware.com/articles/Unicode.html


What modern software stack uses Extended ASCII as its default encoding? The last time I dealt with this problem, it was in 2005 or 2006 and I was working with PHP.


My guess would be that it's a flash app. Flash is (or at least used to be) horrible at internationalization issues like this.


The article's comments says the problem was that the (implicit) character encoding of the HTML page:

"The problem was not the encoding on our data feed, but the HTML document was sent with ISO-8859-1. The second we inserted the twitter text into the DOM, the browsers interpreted the UTF-8 string as ISO-8859-1."


What bugs me is not the mis-encoding (though that's a fail), but that people "struggled to understand" it... surely everyone's seen apostrophes turn into these special characters on enough web pages over the past ten years to have recognized what's happening when it does.


Blame Hollywood. Non-tech folks have been conditioned to panic whenever anything out of the ordinary appears on screen, because that's how Hollywood visualizes computer viruses and alien space invaders.


An error like this should not detract from the value Mass Relevance delivers. Clearly an event like this or similar events like the Oscars in which they take part are better, more engaging because of their involvement.


That’s a lot of detail, but a very good explanation of what happened.


thank you for the fascinating article! I've seen this bug in other places, and I never knew what it was (and usually brushed it off instead of digging deeper to find it).


Well, it could have been worse. They could have shown \'


Why does it say 3 hours ago under the tweet? Wasn't this in real time?


>Wasn't this in real time?

No. Questions were culled from Tweets with the #AskObama hashtag starting on June 30. Some of the questions did come in close to real-time. I think the most recent ones were 5-10 minutes old when presented to the president.


So how did they choose the questions then? Based on retweets, or the Twitter team just picked the ones they liked?


They had a group of moderators who were selected to find the best questions. I believe most of the moderators were journalists or bloggers.


He had to do a twitter townhall because El Jefe couldn't get a G+ invite :-|


tl;dr -- mojibake


So does HN support utf8 ¢ðrrꢆl¥?


˙sǝʎ 'os ʞuıɥʇ I


(ಠ_ಠ)


You have to give it to Microsoft. They use Word even for 140 letter documents now! (I understand spelling is a reason, but browsers have spelling now)


Downvoted, understandably perhaps, but nearly every time I see this sort of bug it's because someone has copy/pasted something from MS Word instead of using whatever native client input method was provided (usually a textarea or input field). Yes, blame the rendering script for not encoding properly, but I suspect this was Boehner's team copy/pasting from some internal MS Word doc. Cheap shot - they also likely copy/paste screenshots in to Word and mail those around too, instead of just mailing the actual graphic file. :)


Cheap shot - they also likely copy/paste screenshots in to Word and mail those around too, instead of just mailing the actual graphic file. :)

When I was in consulting doing software integration work, few things infuriated me more than client "bug reports" arriving in the form of an email containing a 15 megabyte MS Word doc with a bunch of un-annotated screenshots.

I really hope Google someday opens up the awesome bug report/screencap feature in Google+, that lets you highlight part of the screen and redact sensitive parts.


I almost wouldn't mind, except the screenshots are inevitably shrunk down so as to be unreadable.

In OSX, the cmd-shift-4 (and 3) keystroke which screenshots right in to a file are near life-transforming. Snap, drag the file into an email, and it's done. I'm sure there are utilities in Windows which do this, but having it built in is great - no apps to start or install.


Be glad if you've not worked on a project where problem reports can take the form of a blurry, badly contrasted cell phone camera snapshot of the kernel oops that's partially scrolled off the user's screen. :)


press "Print Screen" and paste into new email? Works well in desktop clients like outlook, less so for web-clients like gmail...


NO ONE DOES THAT ON WINDOWS OUTSIDE OF TECH GEEKS. EVERYONE pastes in to MS Word, then emails that document.

That said, from a geek standpoint, I still prefer having the raw image file that I snapped in a folder someplace so I can refer to it later without having to go through sent emails, but that's just a personal preference.


Actually the "send feedback" link in Google+ does exactly that: asking you to highlight parts of the page where the error occurs and then obscure any sensitive info.


Here´s my guess. This political tweet was one of many thought up by speechwriters, entered in a word document. It was passed around a few times to make sure it hit the right talking points, perhaps read to Boehner over the phone, and eventually someone copypasted it to the internet.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: