Hacker News new | past | comments | ask | show | jobs | submit login
Practical HTTP Header Smuggling: Sneaking Past Reverse Proxies to Attack AWS (intruder.io)
158 points by MalacodaV on Nov 11, 2021 | hide | past | favorite | 26 comments



I think the real problem here is that people are writing web servers that don't enforce spec.

As soon as a space in a header name is found, a 400 Bad Request needs to be thrown. "Content-Length abcd: 0" is invalid and should never be accepted.


The real problem is text-based protocols, which are naturally quite flexible (and rather inefficient to parse). If HTTP headers were simply a single-byte identifier[1], then e.g. 03 is Content-Length and there's no way to interpret that as anything else. I've made similar comments about such before: https://news.ycombinator.com/item?id=23582056

[1] A single byte is sufficient --- there have been far less than 255 headers defined since the beginnings of HTTP; maybe custom ones can be defined in an additional space, but a byte is actually already more than sufficient to convey the same information that would take dozens of bytes in the current text-based protocol.


> [1] A single byte is sufficient --- there have been far less than 255 headers defined since the beginnings of HTTP; maybe custom ones can be defined in an additional space, but a byte is actually already more than sufficient to convey the same information that would take dozens of bytes in the current text-based protocol.

You're probably only talking about ones defined in something like an RFC. I'm pretty sure there are far more than 255 different HTTP headers in use just by my employer's in-house stuff.

> more than sufficient to convey the same information that would take dozens of bytes in the current text-based protocol

It's 2021, who cares about dozens of bytes? I'd wager that's far less than 1% of the size of most HTTP exchanges, and it means generic tooling can actually show you something useful for all that custom stuff it's guaranteed not to know the specifics of.


I don't see how you can get to more than 255, unless someone is using them in an incredibly stupid way. What really matters is the standard ones that everyone agrees on. The vendor-specific ones can be in their own space (e.g. 8xh and above), and if that's not enough, define an extension mechanism that adds a byte and you get 65535 different ones, which is definitely more than enough.

It's 2021, who cares about dozens of bytes?

That's the sort of attitude that got us Electron and all the other bloated web crap out there. A little bit adds up quickly, especially at the scale of the Internet.


> I don't see how you can get to more than 255, unless someone is using them in an incredibly stupid way. What really matters is the standard ones that everyone agrees on. The vendor-specific ones can be in their own space (e.g. 8xh and above), and if that's not enough, define an extension mechanism that adds a byte and you get 65535 different ones, which is definitely more than enough.

It's kind of ridiculous to argue for a one-byte address space in 2021, then kludging on an extension mechanism to handle the obvious fact that's too small. And even after that, you're still leaving everyone with ints instead of names. So we'll all get to ask ourselves "what header is 3849, again?" way more than we ever should.

That's ignoring the fact that this proposal is totally DOA unless you can find a time machine to go back to 1989 and hit Tim Berners-Lee on the head with a pipe while he was writing the HTTP spec.

>> It's 2021, who cares about dozens of bytes?

> That's the sort of attitude that got us Electron and all the other bloated web crap out there. A little bit adds up quickly, especially at the scale of the Internet.

There's about a light year between worrying about a dozen bytes and something like Electron.


It's kind of ridiculous to argue for a one-byte address space in 2021, then kludging on an extension mechanism to handle the obvious fact that's too small.

I'm saying that it's not too small. You still haven't mentioned anything about your use-case of needing several hundred(!?!?) different unique headers. 30 years of HTTP and so far there's been less than 100 defined.

So we'll all get to ask ourselves "what header is 3849, again?" way more than we ever should.

Note that even those who have only a very vague idea of what HTTP is, know what a 404 is; and probably 403 too.

There's about a light year between worrying about a dozen bytes and something like Electron.

Multiply that dozen bytes by however much traffic goes through the entire Internet... and it suddenly doesn't look small anymore.


> I'm saying that it's not too small. You still haven't mentioned anything about your use-case of needing several hundred(!?!?) different unique headers. 30 years of HTTP and so far there's been less than 100 defined.

That's not actually my use case. My objection is that you're basically advocating for running a code obfuscator on HTTP requests, which would make them far more painful to work with for very, very little gain.

>> So we'll all get to ask ourselves "what header is 3849, again?" way more than we ever should.

> Note that even those who have only a very vague idea of what HTTP is, know what a 404 is; and probably 403 too.

You're missing the point, numeric codes work there because almost no one defines new HTTP status codes. IIRC, there's maybe two dozen defined and most programmers could probably name only 4 or 5 off the top of their heads. People define new headers all the time.

Ints as identifiers have practically no human meaning at all. Very, very few will know what header 3000 is without looking it up, and because it's an int, when there's a naming collision the meanings will almost certainly be wildly different. It'll be a mess.

> Multiply that dozen bytes by however much traffic goes through the entire Internet... and it suddenly doesn't look small anymore.

Think of is this way: you're wasting bytes in your markup by using long names. Most JSON objects should never need more than 26 variables. How about we restrict variables to single-character strings? It'll save some bytes, and think about how much JSON is zipping around the internet. Ditto with our programs. Lets install a linter on your machine that will fail your builds if you use more than a 1-char variable name.

Also, it's well into the 21st century. Don't you think we should go back to two-digit years for most dates, because bytes?


This is what HTTP/2 does with HPACK compression to compress headers and its "static table" [1] for common headers.

There are 61 headers already defined in this table.

[1] https://httpwg.org/specs/rfc7541.html#static.table.definitio...


That'd be nice - no uppercase/lowercase mixups (or the need to support Referer and Referrer).


so you're just describing what amounts to packets, which is basically like udp or tcp.

The reason http is good is due to the text nature (along with whatever drawbacks associated with it being text).


> A single byte is sufficient

Just use variable length encodings like LEB 128


There is something more than that. HTTP Content smuggling has already been identified as a significant issue and the largest cloud providers and Reverse Proxy server software should have already fixed these issues.

I started a GitHub repo to run integration tests for popular combinations of reverse proxy to popular language web servers to identify these gaps in expectations (how duplicates, capitalization, white space, etc affect HTTP headers in different servers)


Like so many web technologies http headers are a big complicated ad-hoc mess (some headers are specified in a way that’s not standard compliant), so it’s to be expected that there things like this happening.


More interesting is why Content-Length abcd: is treated same as Content-Length: at all? Someone overoptimized the header lookup? Then perhaps other kinds of extensions like Content-Length-abcd are possible, not only with space?


More likely, they're stopping on the first space OR colon to parse the header name since "Content-Length : 0" is valid.

Personally, if I were writing a HTTP request parser while being lazy about enforcing spec, I'd split ONLY on the colon, then just strip the white space on either side of both the header name and value. In Python:

    header, value = line.split(':', maxsplit=1)
    header = header.strip().lower()
    value = value.strip()
After that, `header` should ALWAYS be checked via equality, and never `.startswith(...)`.


Note that parsing is likely more complicated than your code because you have assumed that your “line” has already been identified before parsing the line. AFAIK there is an escape sequence for the header delineator (\r\n).

Also, your code doesn’t fix the issue where a header name with a white space is accepted (which may violate expectations, depending on the server).

Your pseudo code also doesn’t handle edge cases where 2 headers which normalize to the same stripped text collide. One HTTP smuggling vector is the front server keeping a different header value than the back server when 2 header names collide.


> since "Content-Length : 0" is valid.

According to which spec? RFC 7230 allows optional whitespace (OWS) after the colon, but not before it:

   header-field   = field-name ":" OWS field-value OWS


You definitely shouldn't strip left side of header, as space preceding that is syntax for header splitting over multiple lines at least in email. Not sure if this applies to http though, but some parsers may do that anyway, and some don't.

Just shows how easy it is to be wrong by being lazy with http parsing.


I'm guessing they just check what each line starts with. Then they probably split the line on the : to get the value. That would produce the results seen.

It shows just how careful you have to be when writing code that is Internet-facing, and especially on the scale of AWS where you have half the world's hackers trying to find exploits.

I'm not even looking for exploits and I find them every day. For instance, I wanted to read some magazines the other day but they were behind a paywall. Just to see what was behind the wall I checked for a sitemap file. 35MB sitemap.xml contains direct links to the full downloads of every item with no auth needed.


> It shows just how careful you have to be when writing code that is Internet-facing

All code. “Internet facing” is not the only relevant qualification.

Any code where user-generated code is parsed should be carefully written, tested, and documented. Edge cases should be identified and described in specs. Non-compliant software should be identified and shamed (or preferably PRed).

I know that AWS has already patched some HTTP Smuggling attacks maybe 3 years ago, but I don’t remember if is was the same AWS feature (the previous one might have been CloudFront) and the parsing error might have been a little different.


Probably backwards compatibility with some ancient webserver from the dawn of HTTP which became frozen into the protocol forever and everyone who proposes fixes that runs into some grey hair who is worried about xkcd-spacebar'ing someone out there, even though its probably no longer relevant, but standards bodies being what they are it is difficult to accept any risk.


This was mentioned in go's net/http by this CL: https://go-review.googlesource.com/c/go/+/17980/ it's an interesting point that the spec allows this.


Note that the spec is stricter for field-name than for field-value. Field names are ASCII, while field values are latin1 (or mime encoded but no one cares about mime encoding).

And yes I have seen bytes in both names and values in the wild (where bytes in names are invalid but need to be handled gracefully, while bytes in values are effectively valid latin1 if only for legacy reasons)

Looking at the bug you linked to, looks like this almost bit them too. Here's the final field-value behavior they landed on: https://go-review.googlesource.com/c/go/+/18375/


With HTTP/2 you can theoretically even transmit bytes with all binary values inside them in both names and values - since the values are length-delimited.

Based on that, some implementations seem to restrict allowed values to the rules that you describe, while others don't.


Yeah, and really the moral of the story shouldn't be (just) to get good at parsing, but to assume that any two parsers may disagree on how to parse a piece of data as part of your security model.


Does your average person who just barely learned to use nginx to hide a port need to worry aabout this?




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: