Hate these X millions / billions per month or year or a centuary. Let’s do the m...

stavros · on Nov 8, 2017

> Hate these X millions / billions per month or year or a centuary.

I hear you, I hate misleading examples too.

> Dang an Nginx can serve a web page faster then that on a single modern notebook.

"What? Your moon-rocket can only fit three people? My couch can do six. I am not impressed."

DmitryOlshansky · on Nov 8, 2017

I just gave some examples of what can be done at similar speed on dirt cheap hardware.

I my view it helps to understand scale.

More real world example an API at my work logs routinely more then 20M+ events per day, all indexed for NRT search. It’s just a byproduct of the actual work though.

It’s considered business as usual, nobody counts how it adds up half a billion per month.

I assume they index it in some search engine as well. Yeah the scale is x20 bigger, but it’s their _primary_ task.

the_mitsuhiko · on Nov 8, 2017

> More real world example an API at my work logs routinely more then 20M+ events per day, all indexed for NRT search. It’s just a byproduct of the actual work though.

But do your events require you to process gigabytes of debug symbol data per event? Or through megabytes of JSON sourcemap data to unobfuscate stacktrace where some operations are not even possible in linear time?

Don't forget a Sentry event is not a simple data blob we store. It requires very expensive and complex processing.

DmitryOlshansky · on Nov 9, 2017

Our events are a mess of text that is processed with some convoluted parser to build several cross-reference indexes. At least it allows to piece together events from the same request, etc. But the point is that it’s at best 1% of all the work being done, and it’s not the key problem solved by the system.

Parsing a stack frame would be actually in the same ballpark. Yes, processing it would involve maybe a ~100 index lookups to match stack frames to the source etc, but see below.

Gigabytes of debug symbols? Such as fitting all in a 1/10 of RAM of a typical server you mean? As a hash table, or heck even a trie so you could do range query as well.

Same about a source map. It’s not THAT hard to build a largish lookup table entirely in RAM. Especially if its the MAIN POINT of your business.

I’m curious about that non-linear problem with JSON source map. Could you elaborate?

Also you don’t need to cross-reference symbols/events across your customers. The lookup is thus limited to a shard of that client which I’d dare say won’t contain gigabytes of debug symbols.

the_mitsuhiko · on Nov 9, 2017

> Gigabytes of debug symbols? Such as fitting all in a 1/10 of RAM of a typical server you mean?

I think this comments shows quite well why I like working on Sentry event processing so much :)

You are not wrong in that "gigabytes of debug symbols" fit into RAM just fine. This is also most developer tool authors seem to think and again, they are not wrong. If you take a modern iOS app it will have enormous amounts of data in the DWARF files. Same goes with javascript webapps which are starting to accumulate larger and larger sourcemaps. All existing tools to the best of our knowledge assume that your debugging happens on a beefy desktop: you debug one or two apps at the time, you can take all the time in the world to preload the debug info files and off you go.

However we're different. We're not handling a single app, we're handling tens of thousands of customers and their apps and websites. We can't just have all the debug info files resident at all times, they are too many. Likewise the events we get constantly are from vastly different versions. I don't have numbers right now but I guess we store terrabytes of debug data we use for processing. Shuffling this in and out and processing it quickly and efficiently required a non trivial amount of engineering to get working as efficiently as it does now.

> Same about a source map. It’s not THAT hard to build a largish lookup table entirely in RAM. Especially if its the MAIN POINT of your business.

Again, you are absolutely correct. However we used this loaded sourcemap for a fraction of a second. So turns out efficiently re-parsing it all the time with a centralized per-frame cache is actually more efficient than trying to keep the entire sourcemap in cache instead in the workers. There are a lot of subtleties you only discover when you get a lot of non uniform event data from real customers.

> I’m curious about that non-linear problem with JSON source map. Could you elaborate?

Sourcemaps are quite misdesigned. The only thing they can really do is token to token mapping. What they cannot tell you is the name or scope of variables if you need to reverse look them up. This is for instance an issue if you get tracebacks from minified function names. The approach we're using here is parsing the minified JavaScript source from the error location backwards in WTF-16 units, then compare that to the token we see in the unminified source and doing a reverse lookup. This requires (unless you build a custom file format which supports some sort of scope based token mapping based on a fully parsed JavaScript AST which we can't do for a variety of reasons, most of which are related to the fact that we need to fetch external sourcemaps and the time for building this custom lookup takes longer than doing a bounded token search) really fast reverse token lexing and seeking around in WTF-16 indexed but UTF-8 encoded source data.

It's not tremendous amounts of work to implement (if you want to see how we do it, it lives here: https://github.com/getsentry/rust-sourcemap/blob/master/src/...) but it means that per frame where we need to inspect an original function name (which is typically about 50% of frames) we might step an indefinite amount of minified tokens back until we find "minified token prefixed with 'function'". We currently cap this search at 128 tokens and hope we find it. This seems to have been okay enough for now, but there are definitely long functions where this is not enough stepping.

> Also you don’t need to cross-reference symbols/events across your customers. The lookup is thus limited to a shard of that client which I’d dare say won’t contain gigabytes of debug symbols.

Ops can better attest to that but we're not huge fans of depending too much on pinning customers to single workers. It makes the entire system much harder to understand, requires running a much higher number of workers than we do and balances badly. We quite like the fact that we can scale pretty linearly with incoming traffic independent of who causes the events to come in.

DmitryOlshansky · on Nov 9, 2017

> The approach we're using here is parsing the minified JavaScript source from the error location backwards in WTF-16 units, then compare that to the token we see in the unminified source and doing a reverse lookup

What if instead of that you’d actually compile all JS files once to some sensible format that would allow you to do lookup of variable in any scope?

Then you only need token to token map to find the proper location in your precimpiled data.

> However we're different. We're not handling a single app, we're handling tens of thousands of customers and their apps and websites. We can't just have all the debug info files resident at all times, they are too many. Likewise the events we get constantly are from vastly different versions. I don't have numbers right now but I guess we store terrabytes of debug data we use for processing.

So how much do you have of it? All in all, precomputing space-efficient index and storing it in some memcached-style solution should deal with that. I bet you don’t need full debug info as in complete DWARF data.

> Ops can better attest to that but we're not huge fans of depending too much on pinning customers to single workers.

Shards can be rebalanced. Also a memcached or whatever DHTs can used to keep shards. This way workers stay uniform but lookups are still fast.

the_mitsuhiko · on Nov 9, 2017

> What if instead of that you’d actually compile all JS files once to some sensible format that would allow you to do lookup of variable in any scope?

An enormous amount of javascript and sourcemaps we are dealing with are one hit wonders and/or need fetching from external sources. We have no impact on the format of the data there and as mentioned converting it once into a different format does not at all help here.

> I bet you don’t need full debug info as in complete DWARF data.

Correct, which is why as part of making processing of iOS symbols faster we wrote our own file format that we can efficiently mmap. It's 10% of the size of the average DWARF, gives us all the information we want without parsing or complex processing (https://github.com/getsentry/symbolic/tree/master/symcache).

The reason I'm writing this is because as you can see we're very far away from "just handling boring events". There is a ton of complexity that goes into it. We're not idiots here ;)

> Shards can be rebalanced.

That changes nothing about the fact, that unless we have to introduce any form of affinity we are better off writing faster code that does not depend on it.

DmitryOlshansky · on Nov 9, 2017

> Correct, which is why as part of making processing of iOS symbols faster we wrote our own file format that we can efficiently mmap. It's 10% of the size of the average DWARF, gives us all the information we want without parsing or complex processing

Exactly, cool that you do it.

> enormous amount of javascript and sourcemaps we are dealing with are one hit wonders and/or need fetching from external sources.

Hm, so you don’t know what code your customers deploy? At least in JS you seem to imply that.

Anyhow pardon me for beeing rude in my early posts.

It’s exciting things you do and I’d love discuss on some other medium. How can I reach you?

the_mitsuhiko · on Nov 9, 2017

> Anyhow pardon me for beeing rude in my early posts.

No worries, no harm done.

> How can I reach you?

armin@sentry.io or @mitsuhiko on twitter should generally work :)

mattrobenolt · on Nov 8, 2017

What kind of couch do you have that can fit 6 people? Looking for a new couch.

stavros · on Nov 9, 2017

Sectional, yes. I recommend it, it's very comforable.

throwaway9981 · on Nov 9, 2017

Pull out

throwaway9980 · on Nov 8, 2017

Sectional.

shallot_router · on Nov 8, 2017

Serving a static web page 8-50k times per second and doing some really complex processing 8-50k times per second are worlds apart.

aidos · on Nov 8, 2017

I am impressed. Sentry is a brilliant bit of code available for everyone to download and use. It just works and gets better all the time.

And really, in terms of load - at that scale it's all pretty impressive. The payloads are big. They contain the full stack trace along with most the local vars.

zeeg · on Nov 8, 2017

We actually do a lot more than 20B (not sure what the number is actually pulled from), but it's also important to note that even 8000 events/sec isn't free.

- 40k per event (average, it gets into 2mb territory for some)

- streaming processing of everything on the fly

- storage/querying of data

DmitryOlshansky · on Nov 8, 2017

I hear you, at the same time you have embarassingly parallel problem that can be easily sharded per customer. (Do I understand your service correctly?)

Processing 8k * 40k is a feat but I’d think you have more then a few dozens of nodes.

Then it gets into ~100s ev/s per node territory. Quite manageable even for e.g. complex search engines to ingress.

Of course, it all depends on where your bottlenecks are.

the_mitsuhiko · on Nov 8, 2017

> I hear you, at the same time you have embarassingly parallel problem that can be easily sharded per customer. (Do I understand your service correctly?)

The actual event ingestion from an external consumer to our queue is indeed very boring. What makes Sentry internally complex is how incredibly complicated those events can get. For instance before we can make sense of most iOS (as wlel as other native) or JavaScript events we need to go through a fairly complex event processing system. A lot of thought went into making this process very efficient.

Also obviously even if a problem like ours can be easily solved in parallel, you don't just want to add more servers you also want to run a reasonably lean operation. Instead of just throwing more machines at the issue we want to make sure we internally do not do completely crazy things but optimize our processes.

le-mark · on Nov 8, 2017

This is an interesting line of discussion. How do you know you are optimally utilizing the hardware you do have? For example, is each machine running at near 100% cpu? Is the network card saturated? Is the database the bottleneck? How about the search indexing operation? What metrics are you tracking and how are you tracking them? If you're at liberty to discuss, it would be highly enlightening!

the_mitsuhiko · on Nov 8, 2017

> How do you know you are optimally utilizing the hardware you do have?

That's probably one better to answer for actual ops people. However what we generally do is looking at our metrics to see where we see room for improvement. Two obvious cases we identified and resolved through improving our approaches have been JavaScript sourcemap processing and debug symbol handling.

There we knew we were slow based on how far p95 timing was from the average and median. We looked specifically at what problematic event types looked like and what we can do to handle them better. We now use just better approaches to dealing with this data, multiple layers of more intelligent caches and that cut down out event processing time and with it the amount of actual hardware we need.

So the answer at least as far as event processing goes: for some cases we spent time on we were CPU bound and knew we can do better and did. I wrote a bit about what we did on our blog where we moved sourcemap processing from Python to Rust. The latests big wins in the processing department have been a rewrite of our dSYM handling where we now use a custom library to preprocess DWARF files and various symbol tables into a more compact format so we can do much faster lookups for precisely the data we are about.

> Is the network card saturated?

In other cases that has been an issue, but that's less of a problem on the actual workers which are relevant to that number we are discussing here. In particular as far as the load balancers goes there is a ton more data coming in but not all will make it to the system. A non uncommon case is someone deploys our code into a mobile app SDK and then stops paying. The apps will still spam us, but we want to block this off before it hits the expensive parts of the system etc.

> Is the database the bottleneck?

When is a database not a concern ;)

What we are tracking is a ton, how we are tracking is a lot of datadog.

DmitryOlshansky · on Nov 8, 2017

I’d be curious to see how hardware stacks up against events/second end2end as well.

In fact that would be the single most interesting point.

At least a ballpark of say around 20 machines with recent Xeons with ~ X gb ram to process say those 8k/sec.

P.S. In fact calculation were fairly lax and it should closer to 10k/s.

the_mitsuhiko · on Nov 8, 2017

> I’d be curious to see how hardware stacks up against events/second end2end as well.

That heavily depends on which part of the stack you are looking at. (Event ingestion, processing, load balancers, databases etc.). There is a lot less hardware there than one would imagine ;)

DmitryOlshansky · on Nov 9, 2017

Let’s cut down the slack and focus on processing itself.

It wouldn’t hurt to know about the databases if they are heavily used as part of processing

the_mitsuhiko · on Nov 9, 2017

All event processing (currently) takes place on three 32 core Xeon machines. But it has seen much worse days when we had many more before we optimized critical paths. Likewise what we have now is pretty chill, but we need to also consider the case where a worker drops entirely from the pool so these are over-provisioned for the amount of work they each take.

The data layer is probably more interesting from a scaling point of view in general since it's harder to scale, but it's also not exactly a point we're particularly happy with. There is already a comment from david (zeeg) on this post going into our plans there.

tmd83 · on Nov 10, 2017

96 core to process 10K complex event/s sounds quite impressive given the kind of analysis you seem to have to do. Did sentry used to have all the stack trace, local variable data originally or that kind of extensive processing a more recent development? I am more used to Java where the basic stack trace is provided by the VM and I assume processing those are much simpler than some of the user kind of events?

DmitryOlshansky · on Nov 9, 2017

> All event processing (currently) takes place on three 32 core Xeon machines. But it has seen much worse days when we had many more before we optimized critical paths.

Finally I could understand what kind of efficiency you guys have. Not bad at all I’d say though not exactly “high load miracle” the article title would imply.

the_mitsuhiko · on Nov 9, 2017

I'm not sure what the article how the article implies a "high load miracle". But we are quite proud about the efficiency gains and some of our solutions for the problems we encounter I think.

DmitryOlshansky · on Nov 9, 2017

The problem is the title. You guys are doing great, actually!

ubercore · on Nov 8, 2017

I mean, serving a web page is easy. Processing incoming events and everything else that Sentry does is not trivial. You're not comparing likes to likes.

DmitryOlshansky · on Nov 9, 2017

Yes, but they don’t run the whole thing on a single laptop do they?

Roughly speaking you take what a laptop can do and a multiply it by 1000x to estimate what a cloud service could easily do. (It’s just an estimate, the reality is not working like that, I KNOW)

Do not take it as Nginx and webpages example, it’s indeed a poor analogy.

I can’t remember the right number for e.g. ElasticSearch fuzzy search across 10m documents of out hand so I’d pick the simpler less relevant number.

Actually fuzzy full text search across 10m documents is at least in the range of 100s/sec on a laptop. Meaning that they have resources to do at least e.g. 100k-500k/s full text searches across gigabytes of symbols.

I bet their queries are simpler then that though (not fuzzy nor full text). It’s debug symbols after all not a natural text.

notyourday · on Nov 8, 2017

If that's the case why NH littered with these kinds of issues?

https://news.ycombinator.com/item?id=15584315

ubercore · on Nov 8, 2017

I mean serving a web page, from nginx, which is what the person was talking about. Not a database-driven web application.

ktamura · on Nov 8, 2017

At Treasure Data, we ingest about 1-2 million events per second. Using your #, this translates to ~6 trillion events per month.

If you (or anyone on this thread) is interested in learning more, please feel free to ping me.

Kiro · on Nov 8, 2017

How much is a lot for you? 1 billion/s?

DmitryOlshansky · on Nov 8, 2017

At 8k/s given a bunch of hardware a typical cloud service will throw at it it can be down to ~100ev/s per node.

Yes I’d like at least 1k/s per node. That’s something interesting.