Show HN: O(log n) makes continuous profiling possible

jms55 · on March 17, 2021

Coming from a Rust game development background, something that's noticeable to me is that sampling based profiling seeks to solve a different problem than the types of questions games need to answer.

With sampling you can say "I saw function() 100/150 samples (10ms interval between samples), spend your time looking at that!".

That doesn't really work for games, where your work is on the scale of 0.1ms, and you might call the same function 100's of times in 16ms. Calling update_object() 1000 times an interval isn't actually a problem, when it's only taking 5% of your overall frame budget (on average). Games also have the constraint of not wanting your profiling to throw off the game speed itself, it needs to run very fast.

With games, what you really want to know are:

1. Which frames are taking over 16ms? (On the current machine; what's fast enough on one computer isn't on weaker machines. Sometimes you want to look at every frame, and not just the slow ones.)

2. What functions are causing them to go over?

3. Are those functions _usually_ that slow? How fast do they usually run, was this just an uncommon edge case? Mean/median/90th percentile performance?

With typical applications, instead you want to know

* Of the many diverging branches and tasks my application can do, which areas occur the most often, and are therefore worth spending time optimizing?

Sampling based profiling obviously works better for applications like this, while non-sampling profiling works better for games. Sadly, there doesn't seem to be very many GUIs/APIs for working with the latter, especially with linux compatibility. I'm hoping to remedy this, as soon as I get better at GTK.

EDIT: An example of why sampling based profilers (in this case, perf) aren't great for games. Although sandbox::sandbox::Sandbox::temperature_update takes up a decent amount of overall time, it's hard to tell whether that matters per frame or not.

https://profiler.firefox.com/public/c4y22fect9qg9mfm0wv0pjwe...

Dylan16807 · on March 17, 2021

> That doesn't really work for games, where your work is on the scale of 0.1ms, and you might call the same function 100's of times in 16ms. Calling update_object() 1000 times an interval isn't actually a problem, when it's only taking 5% of your overall frame budget (on average). Games also have the constraint of not wanting your profiling to throw off the game speed itself, it needs to run very fast.

I'm confused, because the thing sampling measures is how much of your frame budget goes to each function. And sampling works equally well regardless of how fast your functions are.

But I do agree that looking for variance and focusing on frames that take too long are important.

jms55 · on March 17, 2021

Say you sample every 10ms. This frame, and every frame for the sake of example, looks like this:

* f1() - 1ms

* f2() - 8ms

* f3() - 4ms

* wait_for_next_frame() - 3ms

After 10ms, you'll sample and see f3() is running. Now you seem to think, oh, I should optimize f3(). But f2() is clearly the main cost of the frame. You can totally increase the frequency, so now you sample every 0.1ms or something, and that's probably fine. But you probably want to just instrument your code and get more accurate results at that point.

My main point is that out of the box typical sampling profilers

A: Don't sample fast enough B: Don't display the data split up by frame.

Dylan16807 · on March 17, 2021

That's why you take thousands of samples. Most of them will hit f2 if you do it right.

If your sampler's timer is running perfectly in sync with your framerate, that's a problem, but it's not a "not enough samples" problem. Fix it by taking samples every 10.05-10.07 milliseconds instead.

vlovich123 · on March 17, 2021

Have you tried a flameshot graph? It’ll give you a nicer visualization of the relative amount of cpu something took. Sampling profilers are but one thing in the toolbox. On Windows you can use WPA for extremely detailed tracing and can link CPU work to GPu work (the tooling can be awkward as the GPU one hasn’t seen love in a while). Google is tackling this problem for Android via Perfetto although the tooling isn’t as mature and the GPU stuff will probably stay Android unless Nvidia/AMD extend it for the desktop (seeing as how stadia is being deprioritized).

jms55 · on March 17, 2021

I can't find any references to flameshot graphs, do you mean flamegraphs? They have the same issue, they tend to show the whole program, not per frame.

Perfetto does look interesting though, thank you for the reference.

killingtime74 · on March 17, 2021

May I ask what games use Rust?

jms55 · on March 17, 2021

The most notable game is Veloren https://veloren.net

I made a game https://github.com/JMS55/sandbox

See what other people are up to here https://rust-gamedev.github.io/posts/

chairmanwow1 · on March 17, 2021

Great comment. Thank you for sharing

psykotic · on March 17, 2021

For interval-oriented data that you want to query at different levels of detail with a query window, there's a trick I used in a commercial profiler which you can easily bolt on top of any ordered search tree or database system.

You have a separate tree for each power-of-two scale. The tree corresponding to scale k contains intervals with length between 2^k and 2^(k+1). An interval [u, v] is put in the tree corresponding to scale lg(v - u) under the key u. If you have a query window [a, b] you turn it into a range query a - 2^k <= u <= b for each relevant scale k and filter matches by interval overlap between [u, v] and [a, b]. "Each relevant scale" is defined by the desired level of detail, e.g. you don't query scales which are too small relative to the current zoom level.

Each scale actually has two trees. One type is what I just described. The other type is a summary tree which contains bounding intervals with recursive summaries of all intervals at finer scales. When you cut off queries at scale k you also include data from the summary tree at scale k so that all data in the query window is returned, either verbatim or as part of the bounding intervals with summaries.

An efficient, convenient way of implementing this is to stream the time series data to disk to separate append-only files for each scale. You can start doing interval queries immediately with a reasonable slowdown by directly doing binary search on the raw scale-segregated files (if the records are variable length you can use a COBS-style encoding so you can sync to the nearest record from a random point in the file). That said, the write amplification for a high-fanout B+ tree index is tiny here and building it in a streaming fashion is basically free, so generating the index synchronously is my recommendation. You can also generate the summary files in a streaming fashion but since they involve significant write amplification it's worth decoupling that task to a process with lower IO priority.

To implement the interval query for [a, b] at a given scale you use the B+ tree index to find the starting position in the append-only record file for that scale using the a - 2^k lower bound and just do a linear scan in the file until you exceed b.

There's a bunch of variations and optimizations (e.g. coarser scales than powers of two) but that's the gist of it.

neolog · on March 17, 2021

Any plans for a VSCode extension? I want it to highlight slow parts of my code inline like a linting layer.

petethepig · on March 17, 2021

That’s a great idea.

We were thinking we could integrate Pyroscope with CI tools, so that you would be able to see how much time is spent in different tests. This seems to be an area where a lot of people are looking for quick ways to cut time.

But yeah, having lines highlighed in VS code would be even cooler.

neolog · on March 17, 2021

There's some similar ideas in https://www.destroyallsoftware.com/talks/a-whole-new-world

raihansaputra · on March 17, 2021

The Jest VSCode extension can run the tests and highlight the relevant uncovered lines and annotate failed/successful tests. Maybe helpful to look into their implementation.

neolog · on March 17, 2021

VSCode could show highlighting from different sources. For example, click a button to show what was slow in CI, in production, in local tests.

zwliew · on March 17, 2021

Segment trees are data structures commonly used in competitive programming. It was really interesting seeing it used here!

That said, I wonder if there are any real-world comparisons on production code?

petethepig · on March 17, 2021

Thanks! We loved geeking out on all of these data structures and algorithms.

RE comparisons: Do you mean comparisons of some production profiling data with versus without this compression?

ZephyrBlu · on March 17, 2021

How are segment trees used in competitive programming? I can't immediately see how they would be useful in that situation.

vladimirralev · on March 17, 2021

Competitive programming problems are often designed with intended solution technique in mind first so whatever the trendy thing to do is finds its way into the competitions. Segment trees used to be popular but less so now because it's unfair to some languages. Python and C++ competitors can often work around a segment tree with a sortedlist/ordered_set oneliner while someone using Java is coding it from scratch or resorting to a "hackpack".

knuthsat · on March 17, 2021

Any dynamic programming algorithm where n-th item depends on contiguous k ones with some nice binary operation (like +) can benefit from storing the state in a segment tree, so when you go and calculate n-th item, you can get the value in log(n) instead of O(k) (although prefix sums work fine there too if + is involved).

I know also of examples where lazy segment tree is used too.

Segment trees can also be used to decode the kth lexicographic permutation of a sequence of elements in O(n log n) instead of O(n ^ 2).

Here's one variant of the above application: https://www.spoj.com/problems/QUE2/

ZephyrBlu · on March 17, 2021

The first example makes sense, but could you elaborate on decoding permutations with that problem?

The only way I can currently envision a segment tree being used for a problem like that is grouping elements every k to provide a faster lookup.

Like, these k elements have a value < x, the next k elements have a value < y, etc.

knuthsat · on March 17, 2021

Decoding the k-th permutation:

1: 1 2 3 4 5

2: 1 2 3 5 4

3: 1 2 4 3 5

Number 3 codes into operations that manipulate 1 2 3 4 5 sequence ( https://en.wikipedia.org/wiki/Factorial_number_system ):

1. get 0th elem twice (1 2) 2. get 1st elem (4) 3. get 0th elem twice (3 5)

The data you are maintaining in the segment tree allows you to get the elements out of 1 2 3 4 5 sequence in O(log n), and when you remove the element O(log n), the structure can tell you again what the k-th element is.

If you hold the data in a normal array, you get quadratic algorithm.

Of course, you can use a different tree that allows you to get the k-th element of a sorted sequence but segment trees are easy to code in comppro setting.

KMag · on March 17, 2021

One interesting thing is that using reservoir sampling "Algorithm L"[0], you can actually get an unbiased sample with overhead proportional to the logarithm of CPU time. That would require more state and work in the server to aggregate, though. On the other hand, moving overhead from the process to the aggregation server is useful if you want to run profiling in production.

[0] https://en.wikipedia.org/wiki/Reservoir_sampling#An_optimal_...

yashap · on March 17, 2021

Looks really interesting! How do you avoid data getting all jumbled together though? Like say I’m profiling a web service with 20 endpoints. If I’m profiling locally, I just call the one endpoint I’m interested in, so it’s a “clean”, focused profile. But if I’m profiling against the same service running in prod, with all 20 endpoints getting called all the time, do they all get mixed together? Would there be a good way to separate them out?

petethepig · on March 17, 2021

Right now they are all mixed together. Based on our experience and talking to early users, for some types of applications this is not too big of a deal. You can always zoom in on a particular endpoint and dive deeper if you want to. And in some cases this is actually a great way to get a high level overview of where your app spends most of the time.

In other cases this does become a problem. For example we've seen it with some large rails apps with a lot of controllers — the flamegraph just becomes too big and hard to read.

This is something we're planning to address soon. The storage layer already has support for tags, we plan to make it so that you would be able to split the profiling data by endpoint, e.g something like

  app{endpoint=foo}
  app{endpoint=bar}

and then make queries either for the whole app or individual endpoints.

yashap · on March 18, 2021

That makes sense, sounds like a good approach, and thanks for the thoughtful response!

ramary · on March 17, 2021

Nice! Seems super easy to use and the indexing + search are powerful features.

Is it possible to configure the period of the sampling profiler? e.g. 1ms or 50ms instead of 10ms?

petethepig · on March 17, 2021

Not right now but we'll add that feature soon.

trungdq88 · on March 17, 2021

Looks awesome! Do you have the benchmark on how the profiler would affect the app in term of CPU/Mem usage?

petethepig · on March 17, 2021

We measured it at some point and the numbers were around 1-2% CPU usage increase per 1 profiled process at 100 Hz.

Did not measure memory overhead, but it should be pretty minimal, less than 10 MBs per process.

zurfer · on March 17, 2021

Maybe I am overlooking something, but how do I use this on Windows? Do I need to put everything in Docker?

sdenton4 · on March 17, 2021

This looks hella cool. Kudos!

petethepig · on March 17, 2021

Thanks!

ktpsns · on March 17, 2021

Can you tell about how you made these sweet animations (GIFs)?

still_grokking · on March 17, 2021

How does it compare to just using a JVM agent to profile a production app?

petethepig · on March 17, 2021

Not very familiar with JVM unfortunately. Maybe someone else can chime in and tell us about how it stores profiling data?

brabel · on March 17, 2021

Java Flight Recorder is also meant for being enabled in production and according to Oracle, typically has a very small footprint (in the order of 1-2% overhead).

It even integrates with IntelliJ[1] so you can have your fancy charts as you run your tests or debug your application.

[1] https://www.jetbrains.com/help/idea/read-the-profiling-repor...