I've been working on implementing a solid HTTP library in Rust, currently at http://github.com/chris-morgan/rust-http. Servo has been using Joyent's HTTP parser and this is a problem that I had observed with it. Yes, it is pretty badly implemented, but it's not a mess because of the style of code—that's for performance. It's only a mess because it's inconsistent and incorrect.
Reading the HTTP method in a high-performance way does lead to superficially ugly code. That's why code generation is good. Ragel has been mentioned and I intend to seriously consider using it, but for the moment my own HTTP method reading code is generated with:
This generates the high performance artwork shown at http://sprunge.us/HdTH, which supports extension methods correctly. (Rust's algebraic data types are marvellous for many things in implementing such a spec.)
Yes, indeed. Discussing what I was doing with rust-http in #rust is what got erickt to update that to the current state of Rust development.
I would not be using the Mongrel spec directly for licensing reasons. I want the entire thing to be MIT + AL2. I don't know if it ends up the most efficient. I'll see when I get to experimenting around that.
The most interesting part (to me) is that the server will handle e.g. "PUN" as though it actually said "PUT". I wonder if this could be used as an attack vector?
Sounds a lot like a "confused deputy" situation: imagine that your L7 firewall has a rule to reject any PUT request, but it sees PUN and thus allows the request to pass through to node.js, which then treats it as though it were actually PUT.
It is now 0.10.xxx ... and before that it was 0.9 ; probably when it becomes 0.99, it will go to 0.100 and so on ... I highly recommend to anyone to avoid using Node in production until its developers grow up, and become responsible for a stable API, and not change their minds every 2 months.
Do you have any links that point to "the developers changing their minds every 2 months"? How have you been affected? As a node user for 2 years, my view is that the API has been fairly stable of late.
The mailing list just had a long post by Isaacs about how 0.10 is the last release before 1.0 and that 1.0 should be backward compatible with anything you write for 0.10.
The level of bikeshedded micro-optimization going on in that file is hilarious. The whole thing could be swapped out with a Ragel parser and nobody would notice a thing
> To make a long story short: very old versions of http_parser were based on a Ragel grammar but the handwritten version outperformed it by a significant margin, as in over 10% fewer branches taken.
> Ragel is great but it can't take shortcuts like a human programmer can and once you start jumping around in your grammar, things become complicated fast (and impossible to debug.)
This is not optimization or bikeshedding. Optimization is not supposed to break things, it's supposed to make things more efficient. I don't see any examples of bikeshedding either (this is just code; where's the dispute?). This is a broken implementation of a protocol.
LINK was added after HTTP/1.0, and removed before HTTP/1.1. http://tools.ietf.org/html/draft-ietf-httpbis-method-registr... is being cited, but that is still a draft and the registry that refers to does not yet exist. I believe it is thus fair to say that LINK is not a standard method?
To be certain, whether or not LINK is supported is not really the issue. The file is mishandling HTTP methods in general for the sake of "optimization". Had the parser been written correctly in the first place, it ought to have been trivial for someone to add support for new extension methods like LINK, but since the parser is broken, it becomes significantly more difficult.
Oh, to be certain. The HTTP spec does have extension methods; any token is valid as a method name. (Implementing the spec in Rust has taught me a lot about HTTP.) I guess I should avoid the quibble of the particular example and focus on the bigger picture :-)
- I value software that keeps to the spec, because it's the spec that I (as a dev or non-dev) refer to. You never hear "NodeJS has HTTPish module", nor do you read documentation of that module's concepts and behaviour. Those are defined in the spec, and the __fill_in_with_any_language__ HTTP module just implements those definitions.
- Optimizations, simplifications, corrections should be done in the spec, whenever the you find them at implementation-time.
But until now there has not been ONE HTTP server that grasps and handles the HTTP specs in their whole. So then, I find it hilarious to read that about optimizations when neither of us have the whole picture.
That said, I don't think it's Node.js to blame here (albeit they do have weird views of standards: https://github.com/joyent/node/issues/4850) but HTTP itself because the spec's abstraction levels have been far away from the implementations' reach. HTTPs concepts are gorgeous but they are worth nil if implementation is "hard" and never done properly.
There are plenty of good ways to optimize this code. They didn't pick any of them. What I find more surprising than anything is that they didn't just optimize GET and handle the rest generically.
It should probably be noted that:
a) you don't have use node's built in HTTP server (yeah, I know, nearly everyone will), you are more thn free to write your own or use one from it's module repository (npm)
b) The entire HTTP module is currently being over-hauled in the 0.11.X development branch and the changes should appear in the stable 0.12.x releases.
Out of interest has anyone seen what other web servers support for these more 'esoteric' verbs is like?
For clarification (I assume you know this but the wording was a bit ambiguous) there are extra handlers that you can use to get some nice behaviour, but you don't have to use any of them. It doesn't add the 'magic' unless you explicitly turn it on (i.e. choose the rest handlers).
Nice catch, reading the nodejs code is really discouraging. But the post points to the relevant specs, so I guess this should be fixed rather sooner than later. Initial response looks good: https://github.com/joyent/node/issues/6078
Not a big deal. Even the REST people have long been just sending normal GET and POST and including some indicator that they really want it treated as some other verb.
It's disappointing to see your kind of a response here.
It starts when somebody makes a legitimate observation that a very specific set of functionality has been shown to have serious flaws in practice, and then suggests that perhaps the best course of action is to remove this very specific, and broken, functionality.
Then somebody else comes along, and responds like you did with a smart-ass comment taking it to an overly-broad, unreasonable and stupid extreme. These kinds of comments are useless.
Sometimes the best way to fix broken functionality is to remove it. That in no way means it's the only possible solution, however. Nor does it mean that it needs to be applied without bound.
>It starts when somebody makes a legitimate observation that a very specific set of functionality has been shown to have serious flaws in practice, and then suggests that perhaps the best course of action is to remove this very specific, and broken, functionality.
I missed that. I saw a link that says that we shouldn't have verbs because no one implemented them, LOL SPACEJUMP.
Do you really think that it's a "legitimate observation" that because node.js has a shitty http parser, the parts of http that it parsed badly should be thrown out?
I read two smart-ass comments here, and a civility troll.
The nodejs source credits [0] as inspiration. It seems that nginx actually does something like that: first, it switches on the length of the verb, then it tries to compare the string (using the ngx_str6cmp() and similar macros) to the ones it knows how to handle.
When the length of verb is 4, it checks if the second character is 'O' before trying the full string comparisons, probably because there are 4 possibilities with it (POST, COPY, MOVE, LOCK) vs only one without (HEAD).
It starts at line 180.
That being said, in nginx case, it's a preliminary check, just to avoid doing some full string comparisons in a very specific case. On the other hand, the nodejs directly does a 'parser->method = HTTP_CONNECT' after checking one character. It seems like it actually checks the other characters afterwards in some cases, but this affectation is quite misleading.
I'm not an expert, but I don't think you can do pure bit checks in a modern CPU. It likely does comparisons of char in a single hardware instruction anyway.
Actually, if you look at the nginx source I linked, the comparisons of the full strings are hidden behind macros because they do the opposite: they group characters 4 by 4, and do 32bit int comparisons as much as possible.
an immediate value stored in an instruction would double performance over an l1 data cache reference... but this is stated just to mock people on this thread who want 'almost' machine level logic to match robust user land permutations.
I've seen this kind of thing before. Someone falls in love with their optimization ("we can parse GET with a single int switch!") and when it is pointed out that it doesn't handle the spec correctly, and that handling the spec correctly would make this as slow as the naive way, it still ends up in the code because "it's good enough."
Kind of beside the point, really - this kind of micro-optimization doesn't actually improve performance to the extent you want, especially when you consider the downsides.
You can improve HTTP throughput far more dramatically with higher-level optimizations, like moving request parsing into the kernel or otherwise doing things that address larger bottlenecks involved in getting HTTP requests from the network card. The Windows version of Node performs a lot better if you have it use Microsoft's kernel mode HTTP interfaces than if you use the crazy micro-optimized HTTP support in Node that is responsible for this bug.
Ultimately, the prologue in a HTTP request - 'HTTP/1.1 GET' or whatever - is a tiny string of bytes and parsing it, even using the most slow language on the planet, really isn't that expensive. Optimizing that tiny bit of work isn't going to produce enormous gains. Mostly just an ill-advised attempt at being clever.
Having written a high performance (1Gbps+ on 2nd gen i7, loopback) speed SIP packet capture/parser, I'll just disagree that such optimization against a text based protocol is not always a waste of time or an attempt at being clever. Doing things like comparing strings against 4 or 8 byte integers can show remarkable improvement, especially for the common methods.
HTTP (and SIP) have idiotically complex grammars for no reason other than the authors getting all clever on us because it's a "text based" format. I'd actually be shocked to find that HTTP is actually properly implemented in most cases (fully handling comments and line folding, for example).
And then, even if you do implement it properly, you have to worry about other software interpreting it differently. So Proxy A determines a request is allowed and appends a header, but Server B reads the request in a different way and violates policy.
No, it won't kill the process; it will simply terminate the connection. It looks like, from the C code, that it was intended to throw a 405 but somewhere in the process it screws up and terminates early.
How is it clear? It's quite possible the author measured a naive implementation and found bottlenecks there to optimize. If method parsing was a bottleneck, I could see the overhead of string comparison being quite significant. Single character equality is one instruction vs. a dozen or two (including a loop) for string equality.
For string equality on most of these methods, it's not a dozen.
If your goal is really optimization, you'd do it as a series of vector operations (GCC will let you do this in a platform independent way for almost all of these methods), using mm_cmpeq_*, which will do 4 chars at a time, or 8 chars at a time with AVX/etc.
I've looked at standard library implementations of strncmp before, and there are definitely some crazy tricks done to make it fast. Anyway, totally agree that I'd want to see some benchmarks to prove that it is a bottleneck. My point is that I wouldn't claim that it isn't without seeing benchmarks either.
Another thing to realize is that the relevant parts of the string will be comparable without any cache miss difference between the single char and multi char variants, so the difference would be negligible even if there would be no parallelization of the comparisons.
Reading the HTTP method in a high-performance way does lead to superficially ugly code. That's why code generation is good. Ragel has been mentioned and I intend to seriously consider using it, but for the moment my own HTTP method reading code is generated with:
This is pleasantly easy to read and meaningful.This generates the high performance artwork shown at http://sprunge.us/HdTH, which supports extension methods correctly. (Rust's algebraic data types are marvellous for many things in implementing such a spec.)