Even P99 estimates averaged from many tiny samples are at most inaccurate, they won't be biased.
The problem you talk about is one of granularity: you don't want to know how bad things are in general, you want to know how bad things are when they are actually bad. That's perfectly legitimate, but it's a different metric altogether that just happens to also be based off of the 99th percentile. It depends on your definition of "the distribution".
This probably sounds like nitpicking, but people should know what they're talking about when they call out others for being wrong.
This looks biased to me even though the distribution is uniform. I did not expect that. Maybe the quantile function is broken.
EDIT: I've installed R and run your code. You use a very low number of samples in the mean, so I changed it a bit. 1. I use the uniform distribution, so we know the true value of the 99th percentile is 0.99. 2. I've increased the number of samples:
x <- runif(100000)
quantile(x, 0.99)
mean(sapply(1:2000, function(i) {
i <- i * 50
quantile(x[i-49:i], 0.99)
}))
When run multiple times, I've found that the mean is always lower than the quantile over the full sample.
Actually, doh, you're right, my bad, turns out sample quantiles are only asymptotically unbiased, and a better estimator would take some sort of weighted average of P99 and P99.5 with the weights depending on the sample size. (You still don't need the full population and you don't have to merge histograms, though.)
I find this part of statistics very hard to argue about. Simulations only work with well-known distributions, but even then not all of them. Even when you have a simulation that confirms hypothesis A1 using distribution D1 it says little about some pathetical distribution D2.
Using mathematical precision might take years to get it right and often there is no easy answer.
What I've been looking for is something that is fast and works "good enough". Estimating the 95th percentile and averaging it did not work well in my use case. Using the histogram method does work well although it certainly is not perfect.
Thanks, this is informative and pretty similar to what I've been implementing in JavaScript. I guess I'll have to take a deeper look into literature when doing an update to that topic.
The problem you talk about is one of granularity: you don't want to know how bad things are in general, you want to know how bad things are when they are actually bad. That's perfectly legitimate, but it's a different metric altogether that just happens to also be based off of the 99th percentile. It depends on your definition of "the distribution".
This probably sounds like nitpicking, but people should know what they're talking about when they call out others for being wrong.