Hacker News new | past | comments | ask | show | jobs | submit login
Tell HN: Please Stop Breaking HTTP Clients
124 points by tdavis on July 15, 2009 | hide | past | favorite | 30 comments
For whatever ungodly reason, the news.yc web server doesn't terminate header lines properly when sending a response, which happens to break any HTTP client that actually follows the RFC. Please see RFC 2616 section 3.7.1 and use \r\n as God intended.

   This flexibility regarding
   line breaks applies only to text media in the entity-body; a bare CR
   or LF MUST NOT be substituted for CRLF within any of the HTTP control
   structures (such as header fields and multipart boundaries).
Thank you!



Voted up, as a good example why it's better to build on top of existing software rather than reinvent the wheel (or in this case, Apache)


Apache is probably the worst example you could have possibly picked, it's such a dog.


Eh? Apache is a fine web server which has done a great job over the last 10+ years for all kinds of people and businesses.


Could you expand on why you don't like Apache? I'll grant you that it can be confusing to get used to configuring, especially if you're not really familiar with how the web works, but as far as I can tell it's a pretty damn good server.

What do you prefer over apache?


My fundamental beef with Apache is that people wrap way too much functionality into it, to the point where their entire end-to-end web stack is two sets of processes: Apache and a database server.

On top of that it's configuration is awful, the modules are all awful compared to their domain-specific alternatives, the architecture is awful (especially wrt concurrency), the development process is molasses, and the Apache Software Foundation has all but abandoned httpd for IBM-focused all-Java astronaut architecture with as much bureaucracy as they can possibly fit into a public process.

I hate Apache because I'm intimately familiar with how the web works, and the ASF is responsible for so much web-hostile WS-* garbage.

For a while (~4 years ago) I thought lighttpd was worthwhile as a total replacement, and did a project that hacked it's WebDAV implementation for userspace filesystems (pre-MacFUSE), but fundamentally its architecture is still Apache just with sensible concurrency.

I've come to be really fond of the reverse-proxy model, and really like Nginx running in front of independent app processes using whatever HTTP abstraction is native to the language (WSGI, Rack, Servlets, etc.) along with a nice native high-level spec-focused HTTP server (twisted.web, mongrel, ???). The last web application I wrote from scratch was on Google AppEngine, and I really like their version of WSGI.


> My fundamental beef with Apache is that people wrap way too much functionality into it, to the point where their entire end-to-end web stack is two sets of processes: Apache and a database server.

Fine, that's a technical point, and fair, but that doesn't make it a 'dog'. It works well for many things, and not so well for some others.

Most of the rest of what you write is vitriol and hyperbole ... "awful", "molasses", "hate", "garbage", and so on and so forth. Your "liberal with what you accept and conservative with what you output" (which I voted up, it's a great quote) goes for interaction with other people, as well.

As a member of the ASF, I'd also like to point out that the foundation goes where people want it to. Sure, there are lots of Java projects, but there are some reasons for that. Java tends to be used by people (and especially companies) who also like to be sure of the provenance of their software, and Apache projects can be very sure of that. That process does introduce a bit of bureaucracy, but it's not all that bad, really. Overall, yes, the ASF has gone a little bit 'enterprisey', but that's no reason to get bitter about it. If you don't like it, don't use it. (The FSF also has similar kinds of bureaucracy too, in order to ensure that they have the copyright for their software).

There are also numerous non-Java projects, like CouchDB, Perl, Python and Tcl modules, and 'Harmony' which is Java, but an implementation of the language itself, which requires some fairly interesting hacking, and of course the web server itself. Most of these projects are worked on by different people.

The concurrency model (models, actually), like many things, is of course a tradeoff, with no absolute 'right' answer (although I think Yaws does a pretty good job).

BTW, I do the Apache Tcl stuff, or did... I don't have much time for it these days.


Furthermore:

> I hate Apache because I'm intimately familiar with how the web works, and the ASF is responsible for so much web-hostile WS-* garbage.

Presumably, there are a few people at Apache that know a thing or two about how the web works, too. Even "intimately". Like, say, Roy Fielding, who wrote part of the RFC this whole article is about.


I'm not familiar with what you're talking about. Could you clarify something for me? Why would you need Nginx and twisted.web? Wouldn't you just call your WSGI application from Nginx?


Nginx's architecture doesn't really mesh well with WSGI's interface protocol... I believe it has to do with blocking the serving Nginx process whilst the WSGI request is being processed.

Nginx prefers to "pass off" the request to a web-app server to do the heavy lifting and take care of fast, easy serving in the fastest way possible.


Thanks!


Because it's preferable to always use real HTTP between independent processes at every layer boundary


I don't understand that reasoning. If you've got to got to WSGI at some point anyway, why not straight away?


"select" based web servers and their modern (poll/kevent/whatever) brethren, for serving purely static content, are faster and smaller. But that's always been the case.


I find lighttpd definitely superior :)

(not disagreeing that Apache is a good choice too - just answering your question :D)


Maybe this would help?

Section 19.3

The line terminator for message-header fields is the sequence CRLF. However, we recommend that applications, when parsing such headers, recognize a single LF as a line terminator and ignore the leading CR.


That guidance is for how best to "be liberal in what you accept"

news.yc's silly problem is in the area of "be conservative in what you emit"


A recommendation to applications is hardly as noteworthy as a written spec. Most clients (rightly) go by the spec, as you can see in this thread (PHP and Ruby have been mentioned; Python's urllib.urlopen also fails to properly parse the CR out of headers). So, no, that doesn't help the fact that the current production server violates the spec and breaks most mainstream http clients in the process, and does so with any oversight that is trivial to fix (which should never have made it into production and indeed already has been fixed in the repo).


Voted up. But a lot of homebrew webservers don't adhere and you'll just have to deal. In this case "God" is W3C, and not every decision they made was good, eg the "Referer" [sic] header. :)

[deleted grumble about writing production HTTP clients in 2009]


Neither, I was reading through a Twisted ticket and it was mentioned that this server was breaking the new HTTP client (and why). I then confirmed this ridiculousness myself, and posted it.

I wrote one in 2008, though. Okay, most of one. But in my defense, no comparable [python] client existed at the time and I needed the functionality (synchronous requests were out of the question; a couple threaded clients were available but not robust or fast enough). As a general programmer rule, I tend to avoid writing code wherever possible :)


I wonder if anyone has tried to calculate how many bytes of data transfer have been saved from w3c's spelling of the word "referer".


Fewer than that wasted by redundant CRLFs.


I wonder if this is the reason I get a proxy error when I try and browse HN from my mobile phone.


I had to make changes to ruby's em-http-request for this

http://github.com/coderrr/em-http-request/commit/8e6444fe472...


I had the same problem a while ago and ended up writing a patch to make PHP's HTTP library more flexible:

http://pecl.php.net/bugs/bug.php?id=15223


This was (probably) fixed in Anarki a while ago:

http://github.com/nex3/arc/commit/6cb43b3a5977950a61bfd6ce5a...


The rather good Fiddler debugging proxy (http://www.fiddler2.com/fiddler2/) always flags HN for this. In a popup, no less. So it drives me nuts too, and if it could be fixed, I could click on fewer popups.


It broke my http client too (written in Arc, btw) some time ago: http://arclanguage.org/item?id=8283


So, you're crawling HN and you expect the target of your crawl to fix something so you have an easier time of it :) ?


I'm not doing anything with HN. I expect the server to emit the proper line breaks because that's what it should do, especially if it's going to be made freely available and used by others.

There's no reason that such a simple bug should have made it into or remained in the production source for so long. It's laziness for laziness' sake.


I would think that HN wouldn't really care about someone crawling them as long as they were respectful about it.

Also, as far as writing crawlers/scrapers goes, a header-handling error is one of the least annoying things you can run into.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: