Coming from a Rust game development background, something that's noticeable to me is that sampling based profiling seeks to solve a different problem than the types of questions games need to answer.
With sampling you can say "I saw function() 100/150 samples (10ms interval between samples), spend your time looking at that!".
That doesn't really work for games, where your work is on the scale of 0.1ms, and you might call the same function 100's of times in 16ms. Calling update_object() 1000 times an interval isn't actually a problem, when it's only taking 5% of your overall frame budget (on average). Games also have the constraint of not wanting your profiling to throw off the game speed itself, it needs to run very fast.
With games, what you really want to know are:
1. Which frames are taking over 16ms? (On the current machine; what's fast enough on one computer isn't on weaker machines. Sometimes you want to look at every frame, and not just the slow ones.)
2. What functions are causing them to go over?
3. Are those functions _usually_ that slow? How fast do they usually run, was this just an uncommon edge case? Mean/median/90th percentile performance?
With typical applications, instead you want to know
* Of the many diverging branches and tasks my application can do, which areas occur the most often, and are therefore worth spending time optimizing?
Sampling based profiling obviously works better for applications like this, while non-sampling profiling works better for games. Sadly, there doesn't seem to be very many GUIs/APIs for working with the latter, especially with linux compatibility. I'm hoping to remedy this, as soon as I get better at GTK.
EDIT: An example of why sampling based profilers (in this case, perf) aren't great for games. Although sandbox::sandbox::Sandbox::temperature_update takes up a decent amount of overall time, it's hard to tell whether that matters per frame or not.
> That doesn't really work for games, where your work is on the scale of 0.1ms, and you might call the same function 100's of times in 16ms. Calling update_object() 1000 times an interval isn't actually a problem, when it's only taking 5% of your overall frame budget (on average). Games also have the constraint of not wanting your profiling to throw off the game speed itself, it needs to run very fast.
I'm confused, because the thing sampling measures is how much of your frame budget goes to each function. And sampling works equally well regardless of how fast your functions are.
But I do agree that looking for variance and focusing on frames that take too long are important.
Say you sample every 10ms. This frame, and every frame for the sake of example, looks like this:
* f1() - 1ms
* f2() - 8ms
* f3() - 4ms
* wait_for_next_frame() - 3ms
After 10ms, you'll sample and see f3() is running. Now you seem to think, oh, I should optimize f3(). But f2() is clearly the main cost of the frame. You can totally increase the frequency, so now you sample every 0.1ms or something, and that's probably fine. But you probably want to just instrument your code and get more accurate results at that point.
My main point is that out of the box typical sampling profilers
A: Don't sample fast enough
B: Don't display the data split up by frame.
That's why you take thousands of samples. Most of them will hit f2 if you do it right.
If your sampler's timer is running perfectly in sync with your framerate, that's a problem, but it's not a "not enough samples" problem. Fix it by taking samples every 10.05-10.07 milliseconds instead.
Have you tried a flameshot graph? It’ll give you a nicer visualization of the relative amount of cpu something took. Sampling profilers are but one thing in the toolbox. On Windows you can use WPA for extremely detailed tracing and can link CPU work to GPu work (the tooling can be awkward as the GPU one hasn’t seen love in a while). Google is tackling this problem for Android via Perfetto although the tooling isn’t as mature and the GPU stuff will probably stay Android unless Nvidia/AMD extend it for the desktop (seeing as how stadia is being deprioritized).
I can't find any references to flameshot graphs, do you mean flamegraphs? They have the same issue, they tend to show the whole program, not per frame.
Perfetto does look interesting though, thank you for the reference.
For interval-oriented data that you want to query at different levels of detail with a query window, there's a trick I used in a commercial profiler which you can easily bolt on top of any ordered search tree or database system.
You have a separate tree for each power-of-two scale. The tree corresponding to scale k contains intervals with length between 2^k and 2^(k+1). An interval [u, v] is put in the tree corresponding to scale lg(v - u) under the key u. If you have a query window [a, b] you turn it into a range query a - 2^k <= u <= b for each relevant scale k and filter matches by interval overlap between [u, v] and [a, b]. "Each relevant scale" is defined by the desired level of detail, e.g. you don't query scales which are too small relative to the current zoom level.
Each scale actually has two trees. One type is what I just described. The other type is a summary tree which contains bounding intervals with recursive summaries of all intervals at finer scales. When you cut off queries at scale k you also include data from the summary tree at scale k so that all data in the query window is returned, either verbatim or as part of the bounding intervals with summaries.
An efficient, convenient way of implementing this is to stream the time series data to disk to separate append-only files for each scale. You can start doing interval queries immediately with a reasonable slowdown by directly doing binary search on the raw scale-segregated files (if the records are variable length you can use a COBS-style encoding so you can sync to the nearest record from a random point in the file). That said, the write amplification for a high-fanout B+ tree index is tiny here and building it in a streaming fashion is basically free, so generating the index synchronously is my recommendation. You can also generate the summary files in a streaming fashion but since they involve significant write amplification it's worth decoupling that task to a process with lower IO priority.
To implement the interval query for [a, b] at a given scale you use the B+ tree index to find the starting position in the append-only record file for that scale using the a - 2^k lower bound and just do a linear scan in the file until you exceed b.
There's a bunch of variations and optimizations (e.g. coarser scales than powers of two) but that's the gist of it.
We were thinking we could integrate Pyroscope with CI tools, so that you would be able to see how much time is spent in different tests. This seems to be an area where a lot of people are looking for quick ways to cut time.
But yeah, having lines highlighed in VS code would be even cooler.
The Jest VSCode extension can run the tests and highlight the relevant uncovered lines and annotate failed/successful tests. Maybe helpful to look into their implementation.
Competitive programming problems are often designed with intended solution technique in mind first so whatever the trendy thing to do is finds its way into the competitions. Segment trees used to be popular but less so now because it's unfair to some languages. Python and C++ competitors can often work around a segment tree with a sortedlist/ordered_set oneliner while someone using Java is coding it from scratch or resorting to a "hackpack".
Any dynamic programming algorithm where n-th item depends on contiguous k ones with some nice binary operation (like +) can benefit from storing the state in a segment tree, so when you go and calculate n-th item, you can get the value in log(n) instead of O(k) (although prefix sums work fine there too if + is involved).
I know also of examples where lazy segment tree is used too.
Segment trees can also be used to decode the kth lexicographic permutation of a sequence of elements in O(n log n) instead of O(n ^ 2).
1. get 0th elem twice (1 2)
2. get 1st elem (4)
3. get 0th elem twice (3 5)
The data you are maintaining in the segment tree allows you to get the elements out of 1 2 3 4 5 sequence in O(log n), and when you remove the element O(log n), the structure can tell you again what the k-th element is.
If you hold the data in a normal array, you get quadratic algorithm.
Of course, you can use a different tree that allows you to get the k-th element of a sorted sequence but segment trees are easy to code in comppro setting.
One interesting thing is that using reservoir sampling "Algorithm L"[0], you can actually get an unbiased sample with overhead proportional to the logarithm of CPU time. That would require more state and work in the server to aggregate, though. On the other hand, moving overhead from the process to the aggregation server is useful if you want to run profiling in production.
Looks really interesting! How do you avoid data getting all jumbled together though? Like say I’m profiling a web service with 20 endpoints. If I’m profiling locally, I just call the one endpoint I’m interested in, so it’s a “clean”, focused profile. But if I’m profiling against the same service running in prod, with all 20 endpoints getting called all the time, do they all get mixed together? Would there be a good way to separate them out?
Right now they are all mixed together. Based on our experience and talking to early users, for some types of applications this is not too big of a deal. You can always zoom in on a particular endpoint and dive deeper if you want to. And in some cases this is actually a great way to get a high level overview of where your app spends most of the time.
In other cases this does become a problem. For example we've seen it with some large rails apps with a lot of controllers — the flamegraph just becomes too big and hard to read.
This is something we're planning to address soon. The storage layer already has support for tags, we plan to make it so that you would be able to split the profiling data by endpoint, e.g something like
app{endpoint=foo}
app{endpoint=bar}
and then make queries either for the whole app or individual endpoints.
Java Flight Recorder is also meant for being enabled in production and according to Oracle, typically has a very small footprint (in the order of 1-2% overhead).
It even integrates with IntelliJ[1] so you can have your fancy charts as you run your tests or debug your application.
With sampling you can say "I saw function() 100/150 samples (10ms interval between samples), spend your time looking at that!".
That doesn't really work for games, where your work is on the scale of 0.1ms, and you might call the same function 100's of times in 16ms. Calling update_object() 1000 times an interval isn't actually a problem, when it's only taking 5% of your overall frame budget (on average). Games also have the constraint of not wanting your profiling to throw off the game speed itself, it needs to run very fast.
With games, what you really want to know are:
1. Which frames are taking over 16ms? (On the current machine; what's fast enough on one computer isn't on weaker machines. Sometimes you want to look at every frame, and not just the slow ones.)
2. What functions are causing them to go over?
3. Are those functions _usually_ that slow? How fast do they usually run, was this just an uncommon edge case? Mean/median/90th percentile performance?
With typical applications, instead you want to know
* Of the many diverging branches and tasks my application can do, which areas occur the most often, and are therefore worth spending time optimizing?
Sampling based profiling obviously works better for applications like this, while non-sampling profiling works better for games. Sadly, there doesn't seem to be very many GUIs/APIs for working with the latter, especially with linux compatibility. I'm hoping to remedy this, as soon as I get better at GTK.
EDIT: An example of why sampling based profilers (in this case, perf) aren't great for games. Although sandbox::sandbox::Sandbox::temperature_update takes up a decent amount of overall time, it's hard to tell whether that matters per frame or not.
https://profiler.firefox.com/public/c4y22fect9qg9mfm0wv0pjwe...