Tell HN: Please Stop Breaking HTTP Clients

nir · on July 15, 2009

Voted up, as a good example why it's better to build on top of existing software rather than reinvent the wheel (or in this case, Apache)

blasdel · on July 15, 2009

Apache is probably the worst example you could have possibly picked, it's such a dog.

davidw · on July 15, 2009

Eh? Apache is a fine web server which has done a great job over the last 10+ years for all kinds of people and businesses.

kaens · on July 15, 2009

Could you expand on why you don't like Apache? I'll grant you that it can be confusing to get used to configuring, especially if you're not really familiar with how the web works, but as far as I can tell it's a pretty damn good server.

What do you prefer over apache?

blasdel · on July 15, 2009

My fundamental beef with Apache is that people wrap way too much functionality into it, to the point where their entire end-to-end web stack is two sets of processes: Apache and a database server.

On top of that it's configuration is awful, the modules are all awful compared to their domain-specific alternatives, the architecture is awful (especially wrt concurrency), the development process is molasses, and the Apache Software Foundation has all but abandoned httpd for IBM-focused all-Java astronaut architecture with as much bureaucracy as they can possibly fit into a public process.

I hate Apache because I'm intimately familiar with how the web works, and the ASF is responsible for so much web-hostile WS-* garbage.

For a while (~4 years ago) I thought lighttpd was worthwhile as a total replacement, and did a project that hacked it's WebDAV implementation for userspace filesystems (pre-MacFUSE), but fundamentally its architecture is still Apache just with sensible concurrency.

I've come to be really fond of the reverse-proxy model, and really like Nginx running in front of independent app processes using whatever HTTP abstraction is native to the language (WSGI, Rack, Servlets, etc.) along with a nice native high-level spec-focused HTTP server (twisted.web, mongrel, ???). The last web application I wrote from scratch was on Google AppEngine, and I really like their version of WSGI.

davidw · on July 15, 2009

> My fundamental beef with Apache is that people wrap way too much functionality into it, to the point where their entire end-to-end web stack is two sets of processes: Apache and a database server.

Fine, that's a technical point, and fair, but that doesn't make it a 'dog'. It works well for many things, and not so well for some others.

Most of the rest of what you write is vitriol and hyperbole ... "awful", "molasses", "hate", "garbage", and so on and so forth. Your "liberal with what you accept and conservative with what you output" (which I voted up, it's a great quote) goes for interaction with other people, as well.

As a member of the ASF, I'd also like to point out that the foundation goes where people want it to. Sure, there are lots of Java projects, but there are some reasons for that. Java tends to be used by people (and especially companies) who also like to be sure of the provenance of their software, and Apache projects can be very sure of that. That process does introduce a bit of bureaucracy, but it's not all that bad, really. Overall, yes, the ASF has gone a little bit 'enterprisey', but that's no reason to get bitter about it. If you don't like it, don't use it. (The FSF also has similar kinds of bureaucracy too, in order to ensure that they have the copyright for their software).

There are also numerous non-Java projects, like CouchDB, Perl, Python and Tcl modules, and 'Harmony' which is Java, but an implementation of the language itself, which requires some fairly interesting hacking, and of course the web server itself. Most of these projects are worked on by different people.

The concurrency model (models, actually), like many things, is of course a tradeoff, with no absolute 'right' answer (although I think Yaws does a pretty good job).

BTW, I do the Apache Tcl stuff, or did... I don't have much time for it these days.

davidw · on July 15, 2009

Furthermore:

> I hate Apache because I'm intimately familiar with how the web works, and the ASF is responsible for so much web-hostile WS-* garbage.

Presumably, there are a few people at Apache that know a thing or two about how the web works, too. Even "intimately". Like, say, Roy Fielding, who wrote part of the RFC this whole article is about.

tome · on July 15, 2009

I'm not familiar with what you're talking about. Could you clarify something for me? Why would you need Nginx and twisted.web? Wouldn't you just call your WSGI application from Nginx?

jaddison · on July 15, 2009

Nginx's architecture doesn't really mesh well with WSGI's interface protocol... I believe it has to do with blocking the serving Nginx process whilst the WSGI request is being processed.

Nginx prefers to "pass off" the request to a web-app server to do the heavy lifting and take care of fast, easy serving in the fastest way possible.

tome · on July 15, 2009

Thanks!

blasdel · on July 16, 2009

Because it's preferable to always use real HTTP between independent processes at every layer boundary

tome · on July 18, 2009

I don't understand that reasoning. If you've got to got to WSGI at some point anyway, why not straight away?

davidw · on July 15, 2009

"select" based web servers and their modern (poll/kevent/whatever) brethren, for serving purely static content, are faster and smaller. But that's always been the case.

ErrantX · on July 15, 2009

I find lighttpd definitely superior :)

(not disagreeing that Apache is a good choice too - just answering your question :D)

visitor4rmindia · on July 15, 2009

Maybe this would help?

Section 19.3

The line terminator for message-header fields is the sequence CRLF. However, we recommend that applications, when parsing such headers, recognize a single LF as a line terminator and ignore the leading CR.

blasdel · on July 15, 2009

That guidance is for how best to "be liberal in what you accept"

news.yc's silly problem is in the area of "be conservative in what you emit"

tdavis · on July 15, 2009

A recommendation to applications is hardly as noteworthy as a written spec. Most clients (rightly) go by the spec, as you can see in this thread (PHP and Ruby have been mentioned; Python's urllib.urlopen also fails to properly parse the CR out of headers). So, no, that doesn't help the fact that the current production server violates the spec and breaks most mainstream http clients in the process, and does so with any oversight that is trivial to fix (which should never have made it into production and indeed already has been fixed in the repo).

aristus · on July 15, 2009

Voted up. But a lot of homebrew webservers don't adhere and you'll just have to deal. In this case "God" is W3C, and not every decision they made was good, eg the "Referer" [sic] header. :)

[deleted grumble about writing production HTTP clients in 2009]

tdavis · on July 15, 2009

Neither, I was reading through a Twisted ticket and it was mentioned that this server was breaking the new HTTP client (and why). I then confirmed this ridiculousness myself, and posted it.

I wrote one in 2008, though. Okay, most of one. But in my defense, no comparable [python] client existed at the time and I needed the functionality (synchronous requests were out of the question; a couple threaded clients were available but not robust or fast enough). As a general programmer rule, I tend to avoid writing code wherever possible :)

paulgb · on July 15, 2009

I wonder if anyone has tried to calculate how many bytes of data transfer have been saved from w3c's spelling of the word "referer".

aristus · on July 15, 2009

Fewer than that wasted by redundant CRLFs.

davidw · on July 15, 2009

I wonder if this is the reason I get a proxy error when I try and browse HN from my mobile phone.

coderrr · on July 15, 2009

I had to make changes to ruby's em-http-request for this

http://github.com/coderrr/em-http-request/commit/8e6444fe472...

pufuwozu · on July 15, 2009

I had the same problem a while ago and ended up writing a patch to make PHP's HTTP library more flexible:

http://pecl.php.net/bugs/bug.php?id=15223

ivank · on July 15, 2009

This was (probably) fixed in Anarki a while ago:

http://github.com/nex3/arc/commit/6cb43b3a5977950a61bfd6ce5a...

herf · on July 15, 2009

The rather good Fiddler debugging proxy (http://www.fiddler2.com/fiddler2/) always flags HN for this. In a popup, no less. So it drives me nuts too, and if it could be fixed, I could click on fewer popups.

stefano · on July 15, 2009

It broke my http client too (written in Arc, btw) some time ago: http://arclanguage.org/item?id=8283

jacquesm · on July 15, 2009

So, you're crawling HN and you expect the target of your crawl to fix something so you have an easier time of it :) ?

tdavis · on July 15, 2009

I'm not doing anything with HN. I expect the server to emit the proper line breaks because that's what it should do, especially if it's going to be made freely available and used by others.

There's no reason that such a simple bug should have made it into or remained in the production source for so long. It's laziness for laziness' sake.

kaens · on July 15, 2009

I would think that HN wouldn't really care about someone crawling them as long as they were respectful about it.

Also, as far as writing crawlers/scrapers goes, a header-handling error is one of the least annoying things you can run into.