Never expect old code to handle new protocol elements. If it's not exercised regularly, it's not going to work.
Very few developers (outside of Google) are going to write randomized fuzzers to test for compatibility with theoretical future extensions. They're going to test that it works with existing stuff and then ship it.
So, the fuzzing has got to become part of the public ecosystem.
This isn't really the situation. The old code does not have to handle any new elements. It just has to compare (1,3) to (1,2) and send (1,2) the same as before. It has to look in an explicit list of optional items, and skip over all the optional items it does not know, and include only the ones it does know. This is all trivial and many implementations get it right. You don't need fuzzing for this part. Any remotely sane code would work just as well for (1,4) or (2,0).
There were probably some problems with odd dumb custom-coded gadgets. But mostly, the problem is with "value-add extra-security" boxes in the middle that don't _do_ the protocol at all. They are not a TLS client or a TLS server. They are a product which promises to spy on the traffic and "stop the hackers". They really can't do anything useful. They just see that the handshake packet had (1,3) instead of the familiar (1,2) and drop it. For good measure, they also block packets between those endpoints for 24 hours (so fallback retries don't work). Also they didn't add (1,3) to their recognition system a couple years ago when tls1.3 was in the works, because anyone buying or selling these things is just not very on top of things. So, all the middleboxes which are popular enough to matter, need to be tricked with obfuscation, so they're _really_ not doing anything.
Can't tell whether ploxiln knew this and was just using it as an example anyway, but the TLS version indicator is dead anyway, the TLS 1.3 specification (passed last call but yet to be published) says to always treat this as though you're TLS 1.2 and marks it "legacy_version" but then you hide the real version number inside a new extension which they aim to prevent from rusting in place.
This was done because the version code ploxiln describes is so thoroughly rusted shut in middleboxes that TLS 1.3 would be undeployable in practice without such a changes.
Also, "fallback retries" aren't a thing. The reason is downgrade protection. If I have a new protocol (TLS 1.3) with better security, but I'm happy to retry with an older protocol (say TLS 1.2) if that doesn't work, obviously bad guys on the path will just ensure my TLS 1.3 connections all fail so that they can attack the weaker TLS 1.2
TLS 1.3 guards against an attacker who tries to downgrade a TLS 1.3 connection, if both sides know TLS 1.3 and yet somehow the packets when they arrive from the client say TLS 1.2 on them, the Hello "random" value sent from the server will have the message "DOWNGRD" scribbled across part of it, and the client sees this and aborts because somebody is tampering with the connection. If the middlebox tries overwriting the bytes with "DOWNGRD" written in them then the random data doesn't match up and the connection fails.
What I don't get is why Google really bothers with middleboxes, if a business deploys them and then gets shut off from Google services the admins won't take long to find and disable that middlebox. And if they don't they go bankrupt, whatever happens the Internet wins.
Raymond Chen wrote some articles about how this impacted Windows 95. It doesn't matter that program X completely ignored the documentation in Windows 3.x and so it "makes sense" technically that in Win95 that program crashes, the user experience is that Windows 95 broke Program X, and they just bought Windows 95, so they will demand a refund and moan to all their friends.
There is a limited tolerance for Chrome versions that don't work, because the lesson customers get is "Don't run Chrome" not "My middlebox is garbage". The tolerance is increased for security problems, so Google is more willing to lose say 0.1% of users because they enabled TLS 1.3 (the remaining 99.9% of users get improved security) than to lose 0.1% of users because they added a cool 3D logo and it crashes on a specific model of video card or version of Windows due to a driver bug. But losing 10% of your users is a disaster, and that was the ballpark for TLS 1.3 in earlier drafts (before it was taught to sidestep more middleboxes).
And how about putting an informative tooltip on google's pages?
They could use Javascript to query a test server, which responds with the correct headers. If the answer doesn't get trough, print a message.
Hopefully someone will notice and bring it up with the IT or ISP, especially if the message says to do so, and that Google/internet could stop working in the future.
Actually,get more popular sites to do it, like Facebook, Twitter, Apple and Microsoft (heck, they could do it as a part of the operating system): if a lot of websites say it, users will tend to think that something is wrong on their end.
This was surely already brought up as a solution, so I wonder what the catch was, if any?
Since this would be browser-independent, the browser wouldn't get blamed, and if it's only a mild inconvenience, it shouldn't bother people that much (they have been using the web despite more invasive cookie notices). I would expect it to allow a critical number of non conformant devices to be quickly disabled as a result.
I was more thinking about querying something like testtls.google.com, that would analyze the results server-side (this might require an extra round-trip), and return them to be displayed.
Presumably these middleboxes have some critical density.
If google deploys 1.3 and 50% of corporate users can no longer reach google's services, that would be seen as google's mistake. Moreover, that would hurt google quite a lot.
The issue is that these boxes are already deployed, and it wasn't noticed just how shitty they were until we tried to deploy 1.3 . The system has essentially rusted shut.
I wish Google, Apple and Microsoft just worked as an alliance here. They are all working on TLS and if together they deployed TLS 1.3 no middlebox on earth would stand a chance.
Sometimes I wonder the same. Maybe they think it's hard to change something in an enterprise? Or they are afraid of losing the enterprise customer (because Microsoft works here!). Or just don't want to use their influence to evolve the protocol?
As you say, "fallback retries" are bad. But that is what browsers did, some years ago, when tls1.2 was less common ... and they added the TLS_FALLBACK_SCSV "not-a-cipher-suite" to try to detect attacker-caused downgrades in a dumb-server-compatible way ...
Summary: a video driver cheated by having its implementation of the "do you support this DirectX feature" API always return true no matter what feature was asked for (and it didn't support everything, obviously). This made things crash and led to Microsoft (who didn't write the driver) getting complaints.
The solution, since DirectX features used GUIDs as their identifiers, was to take the MAC address of a new network card, use it to generate one GUID, then smash the card. Since they knew that MAC would never generate another GUID, they put in a check that would ask video drivers "do you support the <GUID from smashed network card> feature?" And if the driver claimed to support that feature, DirectX would know not to trust the driver's claims of feature support.
> because anyone buying or selling these things is just not very on top of things
Sometimes these things are required by law and/or industry standards. IIRC banking sector companies are required by law to record every communication metadata of the employees... which only works with said middleboxes.
You say that as if there were no alternative. They can just install the necessary monitoring on the end-points or buy middleboxes that do proper TLS MITM by completely repackaging the transport instead of trying to mess with the headers.
I guess this is what's meant by "antifragile" -- not merely durable in the sense of "resistant to stresses", but actively improved by exposure to stress.
They tried this. This effort is explicitly because current implementations failed to correctly ignore new features. Since the previous approach was shown not to work in practice they are trying this approach which is intended to make incorrect implementations more obvious.
Some middleboxes treat new protocol elements as a sign of a hack attempt, and drop the connection. It's not a coding error, they're doing it on purpose. Which is why the current push to have as much as possible in the encrypted portion of protocols: so these crazy middleboxes can't prevent protocol evolution.
Who's gonna notice your miss-handling of not-yet-used-in-the-wild parts of the protocol?
If the answer is no one, then it doesn't matter for your profits, and not caring about this case when coding increases your profits.
Google now tries to answer with chrome users.
They come from the cost savings made by a pointy-hair telling that smart aleck code monkey that bothered actually reading the spec that his concerns aren't going to delay the ship date or go into the budget for developer time -- but you know maybe in six months (every six months, it'll be six months out) we can totally revisit that.
Then it ends up in production in too many customer sites with IT that avoid updating stuff (because the updates frequently break things, which is a pain) that it's too late to fix it now. The concerned forward-thinking developer moves on to (slightly) greener pastures to rinse and repeat and the project gets handed to an off-shore team who don't comprehend the difference between current behavior and specified behavior of a product - yet another bold cost-saving measure for pointy-hair to use to argue for his bonus.
Not spending time thinking about handling those, not debugging anything related to them, etc.
The code might look almost the same, but it's always a cost of how to get there. And the code in these middleboxes not rarely smells.
Anybody that doesn't want to introduce crashing/fatal bugs that disrupt productivity? Skipping checks[1] and making assumptions about input[2] is an irresponsible disregard for basic security.
This is about basic programmer competence, not time a consuming feature that might impact your development costs relative to your competitor. You are not going to make more profit by leaving out the "default:" case to your switch/case statements that skips parsing for unrecognized elements.
You can trust in basic programmer competence when there is a certification the programmer has to loose if he displays incompetence, like done for other engineers and also doctors and lawyers and many more.
Until then, you have to make the financial incentives in the short and long term such that they lead to desirable behavior, e.g., producing non-barfing middleware in this case.
Yes, which is why I really like the idea of proactive enforcement with random expected-to-be-ignored tags/parameters. I'm arguing against the idea that leaving out the last part of this
for (item = params->head; item; item = item->next) {
switch (item->type) {
case KNOWN_PARAM_TYPE_FOO:
// do normal stuff
break;
/* ... etc ... */
IGNORE_KNOWN_PARAM_TYPE_BAR:
// fallthrough - BAR explicitly uses default handling
default:
continue; // skip unknown parameters
}
}
is evidence of incompetence, not a strategy that will "make more profit than your competitor".
Also, as the BAR constant suggests, you probably already have code that skips unrelated fields. While the difference in programmer time is almost always trivially small, sometimes it might be zero.
"Basic programmer competence" is not something you can consistently expect from people in the industry. Be it a bad day, general carelessness, or business pressures - there are many reasons to cut corners.
If the standard says a message can be up to 4096 bytes long, but for 15 years the message in practice is always under 250 bytes, there are going to be some implementations that just use a fixed size buffer smaller than 4096 bytes.
Very few developers (outside of Google) are going to write randomized fuzzers to test for compatibility with theoretical future extensions. They're going to test that it works with existing stuff and then ship it.
So, the fuzzing has got to become part of the public ecosystem.