I can't resist (this is the OP): you are missing the point. It's not just throughput, it's high-percentile latency. Latency is critical if you have 1 billion users or 100 users, and it is difficult to bring the high percentiles of the latency distribution down into reasonable territory on Rails since, by default, all operations are essentially serialized.
I guess the problem of high-percentile latency is not widely understood; I'm not sure I understand it myself. Can you explain in more detail? In particular, are you talking about requests that take a while to complete because they have some complex processing, or requests that take a long time to complete because they can't be processed until some other long-running request finishes? The bit about everything being serialized suggests that the main concern is the latter. Does this apply even when using multiple threads under the C Ruby implementation? Why does running multiple web server processes on the same machine not mitigate the problem?
BTW, I don't use Rails or Ruby, but I do use Python for web apps at work (currently CPython, GIL and all). I'm curious to find out if this problem of high-percentile latency applies to Python as well.
So, for any black-box service endpoint, the latency for any given request is obviously just the time it takes for that operation to complete. Ideally one measures both end-to-end latency from the client and server-side latency in order to understand the impact of the network and, for high-throughput applications, any kernel buffering that takes place.
All of that is obvious, I imagine. By "high-percentile latency", I'm referring to percentiles of a distribution of all latency measurements gathered from a given endpoint over some period of time. If you imagine that distribution as a frequency histogram, the horizontal axis ends up being buckets of latency ranges (e.g., 0-10ms, 10-20ms, 20-30ms, etc), and the bars themselves of course represent the number of samples in each such bucket. What we want to do is determine which bucket contains the 95th percentile (or 99th, or 99.9th) latency value.
You can see such a latency distribution on page 10 of this paper which I published while at Google:
http://research.google.com/pubs/pub36356.html
Anyway, it is a mouthful to explain latency percentiles, but in practice it ends up being an extremely useful measurement. Average latency is just not that important in interactive applications (webapps or otherwise): what you should be measuring is outlier latency. Every service you've ever heard of at google has pagers set to track high-percentile latency over the trailing 1m or 5m or 10m (etc) for user-facing endpoints.
Coming back to Rails: latency is of course a concern through the entire stack. The reason Rails is so problematic (in my experience) is that people writing gems never seem to realize when they can and should be doing things in parallel, with the possible exception of carefully crafted SQL queries that get parallelized in the database. The Node.js community is a little better in that they don't block on all function calls by convention like folks do in Rails, but it's really all just a "cultural" thing. I don't know off the top of my head how things generally work in Django...
One final thing: GC is a nightmare for high-percentile latency, and any dynamic language has to contend with it. Especially if multiple requests are processed concurrently, which is of course necessary to get reasonable throughput.
In my experience, when using Django or one of the other WSGI-based Python web frameworks, the steps to complete a complex request are serialized just as much as in Rails. The single-threaded process-per-request model, based on the hope that requests will finish fast, is also quite common in Python land.
You mention that GC is a nightmare for high-percentile latency. Isn't this just as much of a problem for Go? Would you continue to develop back-end services in C++ if not for the fact that most developers these days aren't comfortable with C++ and manual memory management?
For my own project, the GC tradeoff with Go (or Java) is acceptable given the relative ease of development w.r.t. C++. Since there are better structures in place to explicitly control the layout of memory, you can do things with freepools, etc, that take pressure of GC.
For high-performance things like the systems I had to build at Google, I don't know how to make things work in the high percentiles without bringing explicit memory management into the picture. Although it makes me feel like a hax0r to talk about doing work like that, the reality is that it adds 50% to dev time, and I think Go/Clojure/Scala/Java are an acceptable compromise in the meantime.
It is possible to build things that minimize GC churn in python/ruby/etc, of course; I don't want to imply that I'm claiming otherwise. But the GC ends up being slower in practice for any number of reasons. I'm not sure if this is true in javascript anymore, actually... it'd be good to get measurements for that, I bet it's improved a lot since javascript VMs have received so much attention in recent years.
Final point: regardless of the language, splitting services out behind clean protobuf/thrift/etc APIs is advantageous for lots of obvious reasons, but one of them is that, when one realizes that sub-service X is the memory hog, one can reimplement that one service in C++ (or similar) without touching anything else. And I guess that's my fantasy for how things will play out for my own stuff. Ask me how it went in a couple of years :)
Just to be clear, do you mean that writing in C++ and doing manual memory management doubles dev time, or makes it 1.5 times as long as it would be in a garbage collected language?
Also, where does most of that extra dev time go? Being careful while coding to make sure you're managing memory right, or debugging when problems come up?
I don't think that doing manual memory management doubles dev time for experienced devs, no... I just mean that, if you're trying to eliminate GC hiccups by, say, writing a custom allocator in C++ (i.e., exactly what we had to do with this project I was on at Google), it just adds up.
I.e., it's not the manual memory management that's expensive per se, it's that manual memory management opens up optimization paths that, while worthwhile given an appropriately latency-sensitive system, take a long time to walk.