That response actually illustrates my point, oddly enough; median has comforted you to organizationally disregard the super interesting data you're getting from the percentile aggregations. Those blowouts are far more interesting than you're saying, I wager.
p99 blowout is generally an operational smell, even on low-RPS endpoints.
I think we might be talking past each other a little :)
The median has absolutely not "comforted" us, nor do we disregard the percentile data. In fact, we have a whole perf team basically dedicated to looking at 95/9th & other slow transaction trace data, and fixing these.
In big projects, serving 10's of millions of users, with a handful of devs, you've gotta pick your battles, you cant fix everything always. We do our best, just like anyone else. So obviously we agree with the importance of percentiles. Dont make assumptions :)
And all of this is rather irrelevant to the main point now, which is overall _language_ speed.
It's not, actually, and I'm not sure why you think I'm talking past you. Your claim was that Ruby is comparable performance-wise to Go in your application, and you cited median latency to make that case. I pointed out that interpreted languages often have a long tail and that the long tail is more interesting and undermines the comparison. You replied that I am correct and, indeed, you do have a long tail but you wrote it off as "stuff that happens" regardless of language. I am replying that the "stuff that happens" is the interesting stuff that undermines the flawed comparison you attempted to make, and I disagree that all of it can be written off to language-independent concerns; if you investigate it further, you might find some of it comes from the choice of language itself.
I have not wandered off topic into irrelevancy nor talked past you. This is still addressing the point you attempted to make by citing median latency in defense of the performance characteristics of your chosen language, in order to assuage general hesitance to adopt Ruby over performance concerns. I think you'll find that your metrics do not support your data if you dig into that p99.
(I share that experience from handling tens of millions of users myself, which is all I can admit to publicly, so that request re: assumptions goes both ways.)
To recap: claim submitted with flawed supporting data, thread addressing why the supporting data is flawed. No talking past taking place.
Never claimed ruby was as fast as go. Our go services run at about 5-15ms(excluding 95/99 etc), so they're def faster(they do a lot less too!). I'm just saying ruby (or other dynamic langs, i dont mind, I dont have a "chosen language") can make a decently fast product API, and thats all.
And of course we've investigated further... (this is anecdotal i know), but of all the degenerate 99th cases we've investigated/solved, I can only think of 2 or 3 that were lang related, rather than logic, db, n+1s, etc. If you're curious, those were:
- degenerate GC behaviour in a specific circumstance
- JSON serialization (specifically crossing the C/ruby boundary shiteloads of times to serialize an object)
Sure these issues sucked, but we mitigate/sidestepped those issues for the most part.
It's all a trade off when you're trying to decide what to use for your project; right tool for the job etc. Im not evalganizing for or against anything.
Curious what would you propose instead to roughly compare speeds of a product API? (Keeping in mind my anec-data about nearly all of 99ths being a logic/db issue). Comparing 99ths (from my experience) would be more like comparing which codebase has more bugs, because thats how we treat degenerate cases.
Fine, you claimed the performance difference "matters less than [one thinks]," which is pretty much the same thing if you really step back and think about it. It's also wrong for a whole cornucopia of reasons in general, but I'm choosing to focus on the supporting data you used to make that claim.
> Our go services run at about 5-15ms(excluding 95/99 etc),
See, nobody can resist middle-ground aggregations to describe things, even in a thread about middle-ground aggregations! They are such a cognitive trap. You should say "our Go services run at about 5-15ms half the time," because that'd be more correct if my assumption about where that aggregation is coming from is correct (and I'm guessing it's a gutstimate of median). And, again, 95th and 99th are super interesting, particularly when describing the performance characteristics of a latency-sensitive service, and it's a disservice to omit them.
I will absolutely say "about ___ms" and refer to my 99th and let people think I'm telling them average. To me, 99th is my average. (Normally I'd ignore this as pedantry, by the way, but it's the subject of the thread...)
> - JSON serialization (specifically crossing the C/ruby boundary shiteloads of times to serialize an object)
All of the terrible code in the world that handles JSON is one of my favorite "make this app go faster" targets. Its ease and ambiguity is its downfall, because people write genuinely awful code to interact with it. Most of that code is in language standard libraries. I will stand by that remark no matter how much you challenge me on it.
I'm of the fun school of thought now where I treat JSON as an external hot potato and throw it in a sane bag at the edge like protobuf or Thrift internally. If your internal services are communicating in JSON you are wasting a lot of cycles and bandwidth for pretty much no reason. You can switch to MsgPack and get an immediate win if IDLs aren't your thing, or CapnProto and get an even bigger win if they are. If you like being on the fun train, protobuf3+grpc is a pretty fun environment. This complaint even applies to Go even though Go shipped JSON in the standard library with clever reflection, so now everybody wants to lazily expect JSON configuration files which map cleanly to their internal config struct (please, stop doing this and write configuration formats that don't completely suck; looking at you, CoreOS).
Does serialization really matter, you ask? Profile your application and watch how much time it spends dealing with JSON. I've seen switching away from JSON remove the need for entire machines at scale. Whole machines. Because that many cores were freed up by not making every single instance spend 5-6% of its time marshaling and unmarshaling data.
> Comparing 99ths (from my experience) would be more like comparing which codebase has more bugs, because thats how we treat degenerate cases.
In other words, I was correct about organizational comfort with how you interpret 99th percentile latencies.
99th are not your degenerate cases. The poor souls in the 1% are your degenerate cases, and they are users too. A bad 99th percentile latency is bad, no matter how you justify it. Most folks write off 99.9th latency; 99th is a bit strong, especially if we're talking about your 10MM+ (M?)AU app. 1% of requests is a shitload of requests if your volume is as high as your audience description implies. I'm weird in that I consider a strange 99th+ as interesting data worthy of investigation, but I think that should be the norm, too.
As for comparing the performance characteristics of two separate apps, the metrics I'd start with are going to be RPS, TTFB, and TTLB. For the times I mentioned, σ (my personal favorite) as well as 50th, 75th, 90th, 95th, 99th, and 99.9th percentiles. Those are the externally-interesting ones. I also want to know how many cores are running it, how much RAM it consumes, and a whole bunch of other stuff on the inside. But it's a poor comparison at all, really; no two codebases are directly comparable, which I think you already know.
p99 blowout is generally an operational smell, even on low-RPS endpoints.