I've done a lot of ruby and go. Honestly, I would think twice before building a ...

jsmthrowaway · on May 1, 2016

Median latency is a completely useless number. There is literally no use for it, ever. Satan invented median latency to confuse people, because it lies to you and whispers sweet nothings in your ear. Averaging latencies in your monitoring leaves the outliers out in the cold because you will never see them, and tail latency is way more important for user experience.

Quantile your latency. I suspect you'll find 99th tells an interesting story between Ruby vs. Go. The best latency graph has five or six lines at different percentiles (yes, including 50) stacked in a latency sandwich. I'm willing to bet your app has a rough 99th. Maybe even a rough 95th. Nothing against Rails (honest, this applies to most interpreters), but most scripted stuff has a pretty rough latency tail, and you ignore that with median. (Go is not off the hook either; GC pauses can blow out high percentiles if you're memory-heavy.)

I'm pretty passionate about this, sorry. I'm on a mission from God to purge our mortal plane of average/median latency, because it's one of those misapplications of statistics that everybody does without a second thought. Don't even collect it. Average latency is unimportant and misleads people, and misleads people more as you get more popular. It's that 1% of clients sitting there for 2sec that impacts perception of your UX.

dalyons · on May 1, 2016

Sure, and I thought twice before putting that in :) We of course monitor our 95th & 99th primarily, for reasons that you said. But I didn't quote those because they're usually blown out by slow db queries, n+1 selects,n+1 cache gets, etc, issues that happen regardless of language. often they're on rarely used end points, where the cost-benefit of optimising isn't worth it. I feel median gives a resonable proxy for the language/framework execution capabilities.

jsmthrowaway · on May 1, 2016

That response actually illustrates my point, oddly enough; median has comforted you to organizationally disregard the super interesting data you're getting from the percentile aggregations. Those blowouts are far more interesting than you're saying, I wager.

p99 blowout is generally an operational smell, even on low-RPS endpoints.

dalyons · on May 2, 2016

I think we might be talking past each other a little :)

The median has absolutely not "comforted" us, nor do we disregard the percentile data. In fact, we have a whole perf team basically dedicated to looking at 95/9th & other slow transaction trace data, and fixing these.

In big projects, serving 10's of millions of users, with a handful of devs, you've gotta pick your battles, you cant fix everything always. We do our best, just like anyone else. So obviously we agree with the importance of percentiles. Dont make assumptions :)

And all of this is rather irrelevant to the main point now, which is overall _language_ speed.

jsmthrowaway · on May 2, 2016

It's not, actually, and I'm not sure why you think I'm talking past you. Your claim was that Ruby is comparable performance-wise to Go in your application, and you cited median latency to make that case. I pointed out that interpreted languages often have a long tail and that the long tail is more interesting and undermines the comparison. You replied that I am correct and, indeed, you do have a long tail but you wrote it off as "stuff that happens" regardless of language. I am replying that the "stuff that happens" is the interesting stuff that undermines the flawed comparison you attempted to make, and I disagree that all of it can be written off to language-independent concerns; if you investigate it further, you might find some of it comes from the choice of language itself.

I have not wandered off topic into irrelevancy nor talked past you. This is still addressing the point you attempted to make by citing median latency in defense of the performance characteristics of your chosen language, in order to assuage general hesitance to adopt Ruby over performance concerns. I think you'll find that your metrics do not support your data if you dig into that p99.

(I share that experience from handling tens of millions of users myself, which is all I can admit to publicly, so that request re: assumptions goes both ways.)

To recap: claim submitted with flawed supporting data, thread addressing why the supporting data is flawed. No talking past taking place.

dalyons · on May 2, 2016

Never claimed ruby was as fast as go. Our go services run at about 5-15ms(excluding 95/99 etc), so they're def faster(they do a lot less too!). I'm just saying ruby (or other dynamic langs, i dont mind, I dont have a "chosen language") can make a decently fast product API, and thats all.

And of course we've investigated further... (this is anecdotal i know), but of all the degenerate 99th cases we've investigated/solved, I can only think of 2 or 3 that were lang related, rather than logic, db, n+1s, etc. If you're curious, those were: - degenerate GC behaviour in a specific circumstance - JSON serialization (specifically crossing the C/ruby boundary shiteloads of times to serialize an object) Sure these issues sucked, but we mitigate/sidestepped those issues for the most part.

It's all a trade off when you're trying to decide what to use for your project; right tool for the job etc. Im not evalganizing for or against anything.

Curious what would you propose instead to roughly compare speeds of a product API? (Keeping in mind my anec-data about nearly all of 99ths being a logic/db issue). Comparing 99ths (from my experience) would be more like comparing which codebase has more bugs, because thats how we treat degenerate cases.

jsmthrowaway · on May 2, 2016

> Never claimed ruby was as fast as go.

Fine, you claimed the performance difference "matters less than [one thinks]," which is pretty much the same thing if you really step back and think about it. It's also wrong for a whole cornucopia of reasons in general, but I'm choosing to focus on the supporting data you used to make that claim.

> Our go services run at about 5-15ms(excluding 95/99 etc),

See, nobody can resist middle-ground aggregations to describe things, even in a thread about middle-ground aggregations! They are such a cognitive trap. You should say "our Go services run at about 5-15ms half the time," because that'd be more correct if my assumption about where that aggregation is coming from is correct (and I'm guessing it's a gutstimate of median). And, again, 95th and 99th are super interesting, particularly when describing the performance characteristics of a latency-sensitive service, and it's a disservice to omit them.

I will absolutely say "about ___ms" and refer to my 99th and let people think I'm telling them average. To me, 99th is my average. (Normally I'd ignore this as pedantry, by the way, but it's the subject of the thread...)

> - JSON serialization (specifically crossing the C/ruby boundary shiteloads of times to serialize an object)

All of the terrible code in the world that handles JSON is one of my favorite "make this app go faster" targets. Its ease and ambiguity is its downfall, because people write genuinely awful code to interact with it. Most of that code is in language standard libraries. I will stand by that remark no matter how much you challenge me on it.

I'm of the fun school of thought now where I treat JSON as an external hot potato and throw it in a sane bag at the edge like protobuf or Thrift internally. If your internal services are communicating in JSON you are wasting a lot of cycles and bandwidth for pretty much no reason. You can switch to MsgPack and get an immediate win if IDLs aren't your thing, or CapnProto and get an even bigger win if they are. If you like being on the fun train, protobuf3+grpc is a pretty fun environment. This complaint even applies to Go even though Go shipped JSON in the standard library with clever reflection, so now everybody wants to lazily expect JSON configuration files which map cleanly to their internal config struct (please, stop doing this and write configuration formats that don't completely suck; looking at you, CoreOS).

Does serialization really matter, you ask? Profile your application and watch how much time it spends dealing with JSON. I've seen switching away from JSON remove the need for entire machines at scale. Whole machines. Because that many cores were freed up by not making every single instance spend 5-6% of its time marshaling and unmarshaling data.

> Comparing 99ths (from my experience) would be more like comparing which codebase has more bugs, because thats how we treat degenerate cases.

In other words, I was correct about organizational comfort with how you interpret 99th percentile latencies.

99th are not your degenerate cases. The poor souls in the 1% are your degenerate cases, and they are users too. A bad 99th percentile latency is bad, no matter how you justify it. Most folks write off 99.9th latency; 99th is a bit strong, especially if we're talking about your 10MM+ (M?)AU app. 1% of requests is a shitload of requests if your volume is as high as your audience description implies. I'm weird in that I consider a strange 99th+ as interesting data worthy of investigation, but I think that should be the norm, too.

As for comparing the performance characteristics of two separate apps, the metrics I'd start with are going to be RPS, TTFB, and TTLB. For the times I mentioned, σ (my personal favorite) as well as 50th, 75th, 90th, 95th, 99th, and 99.9th percentiles. Those are the externally-interesting ones. I also want to know how many cores are running it, how much RAM it consumes, and a whole bunch of other stuff on the inside. But it's a poor comparison at all, really; no two codebases are directly comparable, which I think you already know.

bumbledraven · on May 1, 2016

Grandparent's point is that 99th percentile latency is, for their application, unrelated to the ruby-vs-Go comparison.

IanCal · on May 1, 2016

> Average latency is unimportant

That feels like an overly-bold claim. Improving your 99th percentile latency from 2 seconds to 1 second may not be worth it if you bring your average latency up from 0.05 seconds to 0.99 seconds.

> It's that 1% of clients sitting there for 2sec that impacts perception of your UX.

Well, surely that depends on what endpoints those are and what people are expecting from them. A 2s wait for a whole rendered dashboard page of your entire organisation may not really be a concern. A 2s wait for 'find out more about our fast CDN' might really harm sales.

Now, if your point was "you shouldn't only look at average latencies" then you're entirely right, but I cannot see how they're irrelevant. The overall distribution is important and I'd actually recommend that people look at this shape. Just picking one percentile is always going to be misleading because you're throwing away a vast amount of data.

jsmthrowaway · on May 2, 2016

That actually was my point, and at no point did I say only look at one percentile. I in fact said look at five or six, stacked, including median/p50. I implied that median latency is only useless by itself, or so I thought, so I apologize if that was unclear. It is perfectly fine in concert with other aggregations. This is the type of graph I mean:

       |                 ====   p99
       |=================       
    ms |+++++++++++++++++++++   p95
       |.....................   p75
       |`````````````````````   p50
       |---------------------
        t ->

That outlier jump might be a production emergency, such as a database server dying or something. Yes, really. Had you only graphed median here, you would have missed it until some other alarm went off.

You gain a lot from this. Visually, you can see how tight your distribution is as the rainbow squeezes together. Narrower the better. Every time. In fact, very often the Y axis is irrelevant, and here's why:

Reining in your p99 that far at the expense of a higher average is, oddly, a win. That might surprise you but it is borne out in practice, because at scale only the long tail of latency matters. A wide latency distribution is problematic, both for scaling and for perception. A very narrow latency distribution is actually better. If you can trade off a bit higher latency for less variance/deviation, it will be a win every time. Weird, I know. User perception is weird and, as you point out, the rules change per-endpoint. Perception tends to evolve, too. As a rule of thumb, though, tighter distribution of latency is always better and how far you can push median to get there is your business decision and user study environment.

To borrow one of Coda Hale's terms[0], the scenario you presented demonstrates the cognitive hazard of average and is actually my root point. The average came up, yes. That is not necessarily bad (at all!), but "average" intuitively tells you that it maybe should be. In this case, it is misleading you, because the exact scenario you presented might be a strong win depending on other factors. A 99th of 1000ms with a 990ms average is a really tight latency distribution so it is fairly close to ideal for capacity planning purposes. It blew my mind when I discovered that Google actually teaches this to SREs because, like you, how unintuitive it is threw me off.

It's hard to swallow that a 990ms average might be better than 50ms. Might be. Average doesn't tell you. That's why it sucks. Not just for computing, either; average is pretty much the worst of the aggregations for cognitive reasons and is really only useful for describing the estimated average number of times you fart in a day to friends, because it is quite overloaded mentally for many people without experience in statistics.

[0]: https://www.youtube.com/watch?v=czes-oa0yik

stock_toaster · on May 1, 2016

Your cause is just. Keep fighting the good fight!

yxhuvud · on May 1, 2016

> Median latency is a completely useless number. There is literally no use for it, ever. Satan invented median latency to confuse people, because it lies to you and whispers sweet nothings in your ear. Averaging latencies in your monitoring leaves the outliers out in the cold because you will never see them, and tail latency is way more important for user experience.

Umm, you do not get the median latency by averaging anything. Median latency is just the 50th percentile. It is definitely not one of the interesting ones to reason about or care about improving, but not valueless to measure. It is interesting to have if you are graphing your latency curves, to make an example.

jsmthrowaway · on May 1, 2016

I think if you reread my comment you'll find that what you're saying does not actually disagree and is a restating of what I said.

yxhuvud · on May 1, 2016

It is not. Your comment confuses the median with the mean latency. The comment you replied to did not mention using averages - it mentioned only using the median. You introduced scathing critique of using averages when those are only used for means and therefore totally irrelevant here (even if I totally agree that people that use them should be educated. But that wasn't the case here).

jsmthrowaway · on May 1, 2016

I actually don't, and you're reacting to the use of "averaging" as a verb. That's why I encouraged you to reread. When you're discarding 50% of the samples in a 50th percentile median, I think "averaging" is an acceptable verb to proxy the situation in English since "medianing" isn't a word. I could have said "computing the median," but that's just tedious.

Notice later in the comment I say average/median, implying that they are separate but related concepts. I think it's safe to assume that someone who can conversationally use the word "quantile" is not confusing median with mean. You're assuming that my (intentional) selection of a lower-fidelity term to describe a concept, which I carefully illustrated with ancillary points to give specificity to said concept, demonstrates a misunderstanding of the very field I'm explaining. That is pretty obviously wrong and a bit condescending.

We agree, which is what's frustrating. You're just latching on to a pedantic correction of my point, and rather than belabor that correction I encouraged you to reread to see that we do actually agree. Now, I do see average latency far more often than I'd care to admit, which is why I got lazy and just said "average" at the end there once I started referring to generality instead of specificity, but I think it's pretty clear that I understand the distinction regardless.

skj · on May 2, 2016

In statistician-land, the median is simply one way of averaging, as is the mean. Introducing (or at least using) the extra term takes care of some ambiguity.

pstuart · on May 1, 2016

> debugging is way harder. Hope you like printf.

You have other options: https://github.com/derekparker/delve

jud_white · on May 2, 2016

> having no stack for errors is insanely frustrating.

Check out https://github.com/pkg/errors

If you want more rich control of the output of the stack trace there's https://github.com/go-stack/stack

dalyons · on May 2, 2016

Yeah, we've started using that. Kind of amazing that you need an addon lib & Wrap() everywhere to make errors useful. Just more boilerplate! :)

_pfxa · on May 1, 2016

> - debugging is way harder. Hope you like printf.

Can't you use gdb with Go? https://golang.org/doc/gdb

spriggan3 · on May 1, 2016

The second you launch a Go routine, gdb becomes useless.

_pfxa · on May 1, 2016

How do you debug then? (I'm not a Go coder, tho I'm familiar a bit with it)

sacado2 · on May 1, 2016

Oh, that's simple. We gophers don't make any mistake. Ever. It makes our lives easier.

jsmthrowaway · on May 2, 2016

This is wrong. The answer is obviously "Go is debugged in production with extensive logging," so I question whether you're actually a gopher.

dmitshur · on May 1, 2016

See another answer in this thread, https://news.ycombinator.com/item?id=11607829.