I just gave some examples of what can be done at similar speed on dirt cheap har...

the_mitsuhiko · on Nov 8, 2017

> More real world example an API at my work logs routinely more then 20M+ events per day, all indexed for NRT search. It’s just a byproduct of the actual work though.

But do your events require you to process gigabytes of debug symbol data per event? Or through megabytes of JSON sourcemap data to unobfuscate stacktrace where some operations are not even possible in linear time?

Don't forget a Sentry event is not a simple data blob we store. It requires very expensive and complex processing.

DmitryOlshansky · on Nov 9, 2017

Our events are a mess of text that is processed with some convoluted parser to build several cross-reference indexes. At least it allows to piece together events from the same request, etc. But the point is that it’s at best 1% of all the work being done, and it’s not the key problem solved by the system.

Parsing a stack frame would be actually in the same ballpark. Yes, processing it would involve maybe a ~100 index lookups to match stack frames to the source etc, but see below.

Gigabytes of debug symbols? Such as fitting all in a 1/10 of RAM of a typical server you mean? As a hash table, or heck even a trie so you could do range query as well.

Same about a source map. It’s not THAT hard to build a largish lookup table entirely in RAM. Especially if its the MAIN POINT of your business.

I’m curious about that non-linear problem with JSON source map. Could you elaborate?

Also you don’t need to cross-reference symbols/events across your customers. The lookup is thus limited to a shard of that client which I’d dare say won’t contain gigabytes of debug symbols.

the_mitsuhiko · on Nov 9, 2017

> Gigabytes of debug symbols? Such as fitting all in a 1/10 of RAM of a typical server you mean?

I think this comments shows quite well why I like working on Sentry event processing so much :)

You are not wrong in that "gigabytes of debug symbols" fit into RAM just fine. This is also most developer tool authors seem to think and again, they are not wrong. If you take a modern iOS app it will have enormous amounts of data in the DWARF files. Same goes with javascript webapps which are starting to accumulate larger and larger sourcemaps. All existing tools to the best of our knowledge assume that your debugging happens on a beefy desktop: you debug one or two apps at the time, you can take all the time in the world to preload the debug info files and off you go.

However we're different. We're not handling a single app, we're handling tens of thousands of customers and their apps and websites. We can't just have all the debug info files resident at all times, they are too many. Likewise the events we get constantly are from vastly different versions. I don't have numbers right now but I guess we store terrabytes of debug data we use for processing. Shuffling this in and out and processing it quickly and efficiently required a non trivial amount of engineering to get working as efficiently as it does now.

> Same about a source map. It’s not THAT hard to build a largish lookup table entirely in RAM. Especially if its the MAIN POINT of your business.

Again, you are absolutely correct. However we used this loaded sourcemap for a fraction of a second. So turns out efficiently re-parsing it all the time with a centralized per-frame cache is actually more efficient than trying to keep the entire sourcemap in cache instead in the workers. There are a lot of subtleties you only discover when you get a lot of non uniform event data from real customers.

> I’m curious about that non-linear problem with JSON source map. Could you elaborate?

Sourcemaps are quite misdesigned. The only thing they can really do is token to token mapping. What they cannot tell you is the name or scope of variables if you need to reverse look them up. This is for instance an issue if you get tracebacks from minified function names. The approach we're using here is parsing the minified JavaScript source from the error location backwards in WTF-16 units, then compare that to the token we see in the unminified source and doing a reverse lookup. This requires (unless you build a custom file format which supports some sort of scope based token mapping based on a fully parsed JavaScript AST which we can't do for a variety of reasons, most of which are related to the fact that we need to fetch external sourcemaps and the time for building this custom lookup takes longer than doing a bounded token search) really fast reverse token lexing and seeking around in WTF-16 indexed but UTF-8 encoded source data.

It's not tremendous amounts of work to implement (if you want to see how we do it, it lives here: https://github.com/getsentry/rust-sourcemap/blob/master/src/...) but it means that per frame where we need to inspect an original function name (which is typically about 50% of frames) we might step an indefinite amount of minified tokens back until we find "minified token prefixed with 'function'". We currently cap this search at 128 tokens and hope we find it. This seems to have been okay enough for now, but there are definitely long functions where this is not enough stepping.

> Also you don’t need to cross-reference symbols/events across your customers. The lookup is thus limited to a shard of that client which I’d dare say won’t contain gigabytes of debug symbols.

Ops can better attest to that but we're not huge fans of depending too much on pinning customers to single workers. It makes the entire system much harder to understand, requires running a much higher number of workers than we do and balances badly. We quite like the fact that we can scale pretty linearly with incoming traffic independent of who causes the events to come in.

DmitryOlshansky · on Nov 9, 2017

> The approach we're using here is parsing the minified JavaScript source from the error location backwards in WTF-16 units, then compare that to the token we see in the unminified source and doing a reverse lookup

What if instead of that you’d actually compile all JS files once to some sensible format that would allow you to do lookup of variable in any scope?

Then you only need token to token map to find the proper location in your precimpiled data.

> However we're different. We're not handling a single app, we're handling tens of thousands of customers and their apps and websites. We can't just have all the debug info files resident at all times, they are too many. Likewise the events we get constantly are from vastly different versions. I don't have numbers right now but I guess we store terrabytes of debug data we use for processing.

So how much do you have of it? All in all, precomputing space-efficient index and storing it in some memcached-style solution should deal with that. I bet you don’t need full debug info as in complete DWARF data.

> Ops can better attest to that but we're not huge fans of depending too much on pinning customers to single workers.

Shards can be rebalanced. Also a memcached or whatever DHTs can used to keep shards. This way workers stay uniform but lookups are still fast.

the_mitsuhiko · on Nov 9, 2017

> What if instead of that you’d actually compile all JS files once to some sensible format that would allow you to do lookup of variable in any scope?

An enormous amount of javascript and sourcemaps we are dealing with are one hit wonders and/or need fetching from external sources. We have no impact on the format of the data there and as mentioned converting it once into a different format does not at all help here.

> I bet you don’t need full debug info as in complete DWARF data.

Correct, which is why as part of making processing of iOS symbols faster we wrote our own file format that we can efficiently mmap. It's 10% of the size of the average DWARF, gives us all the information we want without parsing or complex processing (https://github.com/getsentry/symbolic/tree/master/symcache).

The reason I'm writing this is because as you can see we're very far away from "just handling boring events". There is a ton of complexity that goes into it. We're not idiots here ;)

> Shards can be rebalanced.

That changes nothing about the fact, that unless we have to introduce any form of affinity we are better off writing faster code that does not depend on it.

DmitryOlshansky · on Nov 9, 2017

> Correct, which is why as part of making processing of iOS symbols faster we wrote our own file format that we can efficiently mmap. It's 10% of the size of the average DWARF, gives us all the information we want without parsing or complex processing

Exactly, cool that you do it.

> enormous amount of javascript and sourcemaps we are dealing with are one hit wonders and/or need fetching from external sources.

Hm, so you don’t know what code your customers deploy? At least in JS you seem to imply that.

Anyhow pardon me for beeing rude in my early posts.

It’s exciting things you do and I’d love discuss on some other medium. How can I reach you?

the_mitsuhiko · on Nov 9, 2017

> Anyhow pardon me for beeing rude in my early posts.

No worries, no harm done.

> How can I reach you?

armin@sentry.io or @mitsuhiko on twitter should generally work :)