perf does not provide me with the complete callstack, it's a sampling profiler. ...

perf does not provide me with the complete callstack, it's a sampling profiler.

In effectively all latency-sensitive contexts, sampling is worthless. 99.999999% of the time the program is waiting for IO, and then for a handful of microseconds there's a flurry of activity. That activity is the only part I care about and perf will effectively always miss it and never record it to completion.

I need to know the exact chain of events that leads to an object cache miss causing an allocation to occur, or exactly the conditions which led to a slow path branch, or which request handler is consistently forcing buffer resizes, etc.

I never need a profiler to tell me "memory allocation is slow" (which is what perf will give me). I know memory allocation is slow, I need to know why we're allocating memory.