Hacker News new | past | comments | ask | show | jobs | submit login
How Sentry Receives 20B Events per Month While Preparing to Handle Twice That (stackshare.io)
164 points by ryangoldman on Nov 8, 2017 | hide | past | favorite | 87 comments



20B/month is interesting in the sense that Sentry should only be getting a transaction when there's an error in some third party application.

Makes you wonder if a couple of their clients paid for this error monitoring service, then quit watching it. That's a lot of errors.


Suppose you track all client-side errors. Some portion of your users have a blocker that bans an analytics tool you use. Every hit generates an error. You could disable the error if you wanted in that case, but why not keep it online and use the graph on sentry to track how often it's happening?


One really big thing I learned from doing this on a public website: there are a ton of ISPs, browser extensions, antivirus and malware which inject really horrible JavaScript into every page. If you run Sentry as a front-end error collector you will get a ton of bizarre-seeming errors for other peoples’ code and, if you're really lucky, find that some of them used the same variable or function names as your code.


You can set it up so that Sentry only tracks errors on a subset of pages. One approach I've seen is to have 100% tracking on your canary deployments or for a few minutes/hours after a deployment, and then use the tried and true `if (rand() < .2) { sentry.log(...); }` approach to limit the errors you see. It does mean that you may not receive certain critical errors, but if you are trying to stay under their billing tier thresholds or just trying to rate limit yourself, it's a reasonable trade-off (especially if you are logging 100% on Canaries or for a period after rollouts). You can also customize the error rate based on page so that, for example, you get 100% of errors on checkout but only 10% of errors on the homepage.


I was self-hosting so it was a question of seeing traffic rather than quotas. I'm loathe to sample errors since it often seems like the outliers are the most interesting but at some point it's expensive to avoid.


Is there a good way of preventing unauthorized javascript injections? Some type of onReady cleanup script that can identify what is supposed to be there and then strip out everything else?


HTTPS


Sadly, not completely. A surprising nonzero number of users have isps that insist on using their own root cert to inject this nonsense.


Or just local antivirus. There is a shocking amount of user level firewalls that break the web really badly.


Why would a standard browser trust that certificate?


It's installed by the ISP's setup program which they tell everyone is mandatory (otherwise they won't get the adware kickbacks) and the techs are discouraged from skipping.


How often that does even happen? Many people just get a router (w/modem) in the mail which they just connect to with their devices. No software is being installed.


100% of Comcast and Verizon installs in my experience. RCN offered but wasn't pushy and they seem to have stopped.


facepalm


Not really. That stops an ISP, but local malware, antivirus, etc can be browser extensions.


You can configure Sentry's JavaScript client library to only collect errors that originate from your script files (even rejecting inlined code).

I wrote more about this and other techniques for battling noise here: https://blog.sentry.io/2017/03/27/tips-for-reducing-javascri...


Yep — thank you for the hard work on that over the years. For me HTTPS + CSP was a good denoising traffic but it was still memorable for just how broken the web environment is, especially internationally (not just the Great Firewall, either – I should have saved what looked like a tracker being injected into requests from an Iranian university).


I believe sentry has support for reversible minification mapping. You could use that along with a PRNG (or maybe just magic number-based) minification to drive 'hard to conflict' js delivery. Unfortunately, I bet any script delivered that way looks a lot like malware. ;_;


Ah, okay. Just seems wasteful to send the whole stack trace, local vars, etc, for that. But, yes, I can see people doing what you're describing.


If you deploy some non-edge-case bug then you can easily generate thousands of events before you even receive the notification - assuming your service is any popular, but even if it's not it can be e.g. some stuck background job.

Btw, it's on average 7.5k events per second. Not that drastic but of course peak, not average is what matters.


disclaimer: Armin from Sentry here

I'm not super happy that this number is in the title because I find this number myself very irrelevant to our operations (not even sure this number is correct, it sounds too small). What's more challenging to us is the complexity of individual events and the bursting of it. A lot of interesting engineering went into that.


Yeah we use sentry, and quite often generate an insane amount of errors on occaison (sorry). And also management not prioritising fixing low priority errors, means we have quite a high error count even then. That 20b does sound quite low based on just our usage of it.


I get the idea, but I would think the right use of the product is to fix the root causes, or suppress errors that aren't actionable. I was basing this off of somewhere on the site that mentioned roughly 200 companies using it. Perhaps it's grown. Would be interesting to know the total per month after they are deduplicated.


Not sure where you'd see 200 companies (we've had more than that for most of Sentry's lifetime). Not every team lines up to a company, but napkin math I'd say we have 10k - 20k companies using our cloud service. That's also excluding all of the open source installations, which is a pretty large number.


Just a few weeks ago, I hooked up Sentry to our co.'s client-side app, which is a widget/plugin of sorts that's installed on at least a few hundred thousand sites.

Errors on even a tiny fraction of our users can be significant -- we immediately hit the free tier cap. 20B/month does not surprise me at all, especially considering some users are going to be overzealous in their event reporting.


If you have that kind of traffic, we recommend enabling sampling. All client libraries should support sampling out of the box.


I'm sure that happens, but 20 billion a month is around 7700 requests per second, for presumably quite a few companies. It seems high, but not exaggeratedly so.


We started using paid Sentry recently (previously were on an old self-hosted version). It's mostly been great, but boy is search clunky. Very difficult to find anything that doesn't automatically appear on the first page of recent events.


It's definitely on our short list of problems to solve, but unfortunately the solution takes a bit more than traditional feature development.


This statement has me very curious. Are you able to elaborate at all on what you mean? Like is this more of an underpinning architecture issue that can't simply be addressed by scheduling dev time?


The storage mechanism isn't fast/flexible enough to do anything really great. It's fairly easy for us to do precise matches (with some caveats), but given everything is cached at the database level we can't easily compute results based on your search query.

Effectively what we do is:

SELECT * from groups where ID IN ( SELECT group_id FROM index WHERE key = :searchKey AND value =:searchValue UNION [...] )

(This is a simplification but it should give you an idea of how the queries are built)

Because of the way the model works, its very hard to do certain things like exclusion queries, and more importantly all of the results you're seeing are still cached. The biggest pain point here is if you're searching for e.g. "ConnectionError environment:production", you really don't want to see anything related to non-production. We're solving that problem immediately, but its just the tip of the iceberg.

Next year we're kicking off a large project to overhaul the key infrastructure which powers the stream/search functionality, with some pretty big ambitions.

Other services in the space generally use something like Elastic search, which can provide some of this out of the box. We've always been built on SQL/Redis, and given that Elastic has its own set of problems we've decided that it's likely best for us to move to a columnar store format that doesn't cache results (e.g. counts), but rather computes them in real-time much like Scuba.


I'm super happy to read that you plan on improving search!

I really like Sentry, the UI is pretty good, but still a bit counter-intuitive or limited in some way. For example when looking at an issue, I click on a release in the "last seen" section on the right it takes me to the page dedicate to this release, but when I click on the release tag for the event (at the top in the main section) it link to the global search, from which if you click on an issue it won't show you the event for the release. There is also no easy way to see the event for an issue that only matches a specific release (you have to do the exact search yourself).

I also have trouble with the release management and the regression detection but that's because we have multiple production branches in flight at the same time. I guess we're a bit outside the expectation here :)


Sentry 6.x experimented with throwing various attributes of the event object into ES. Have you ruled out using ES in the future?

FWIW, I'm still running a version of 6.x where I put the tags for every event into Splunk of all things. The Splunk events then link back to the corresponding Sentry event. It's slow and klunky, but it gives the Sentry users at $dayjob a much better search interface and ability to slice and dice on tags.

(Note, I'm not suggesting you use Splunk in your backend!)


I think we’re pretty sure about not using elastic. We toyed more than once with it and we’re not confident to be able to solve our problems with it.


You want a job yet? ;)


I appreciate you taking the time to share. Thanks.


Bit late to the party here, but any chance of a mobile app similar to newrelic in the near future? cheers


That's great to hear. It's an awesome product, thanks for all the hard work!


seriously though, that gif of Matt is excellent


yeah, 9MiB for a gif is excellent


have you heard about Lazy Image Load?


Why would you want to lazy load it?


I had the pleasure of hanging out with the Sentry folks in SF for a couple days last week. Honestly a top-notch group.

It was eye-opening to see what is possible when you have an organization like theirs where you have a lot of talent and pride in your product across the board.


We're using it and it 'magically' detects errors in CI that no human would've had detected.


How does this compare to other competitors in the RUM space, including open source tools?


I'm not sure what rum is (other than an alcoholic beverage) but as far as I know Sentry is the only actively maintained Open Source tool in its space (which is crash reporting). Might be wrong on that though.



Is 7,700 per second really that impressive?


too bad the dashboard is slow as molasses


Could you reach out to support with your account details? There's a cost to the data volume, but generally queries should respond very quickly.


tl;dr we are hiring


Hate these X millions / billions per month or year or a centuary.

Let’s do the math.

So it’s 20B/30 = 660 m / day

660 / 24 ~= 25 m / hour

25000k / 3.6k sec = 8000 events/sec

Now peaks got to be larger then these measly 8000 events.

Even say peak is 50k/second.

Dang an Nginx can serve a web page faster then that on a single modern notebook. CPU won’t even saturate to do that.

Or if you were to write them to a consumer grade spinning drive disk at 100mb/s you’d get 50k/s if each event is 2kbytes in size.

As usual I’m not impressed.


> Hate these X millions / billions per month or year or a centuary.

I hear you, I hate misleading examples too.

> Dang an Nginx can serve a web page faster then that on a single modern notebook.

"What? Your moon-rocket can only fit three people? My couch can do six. I am not impressed."


I just gave some examples of what can be done at similar speed on dirt cheap hardware.

I my view it helps to understand scale.

More real world example an API at my work logs routinely more then 20M+ events per day, all indexed for NRT search. It’s just a byproduct of the actual work though.

It’s considered business as usual, nobody counts how it adds up half a billion per month.

I assume they index it in some search engine as well. Yeah the scale is x20 bigger, but it’s their _primary_ task.


> More real world example an API at my work logs routinely more then 20M+ events per day, all indexed for NRT search. It’s just a byproduct of the actual work though.

But do your events require you to process gigabytes of debug symbol data per event? Or through megabytes of JSON sourcemap data to unobfuscate stacktrace where some operations are not even possible in linear time?

Don't forget a Sentry event is not a simple data blob we store. It requires very expensive and complex processing.


Our events are a mess of text that is processed with some convoluted parser to build several cross-reference indexes. At least it allows to piece together events from the same request, etc. But the point is that it’s at best 1% of all the work being done, and it’s not the key problem solved by the system.

Parsing a stack frame would be actually in the same ballpark. Yes, processing it would involve maybe a ~100 index lookups to match stack frames to the source etc, but see below.

Gigabytes of debug symbols? Such as fitting all in a 1/10 of RAM of a typical server you mean? As a hash table, or heck even a trie so you could do range query as well.

Same about a source map. It’s not THAT hard to build a largish lookup table entirely in RAM. Especially if its the MAIN POINT of your business.

I’m curious about that non-linear problem with JSON source map. Could you elaborate?

Also you don’t need to cross-reference symbols/events across your customers. The lookup is thus limited to a shard of that client which I’d dare say won’t contain gigabytes of debug symbols.


> Gigabytes of debug symbols? Such as fitting all in a 1/10 of RAM of a typical server you mean?

I think this comments shows quite well why I like working on Sentry event processing so much :)

You are not wrong in that "gigabytes of debug symbols" fit into RAM just fine. This is also most developer tool authors seem to think and again, they are not wrong. If you take a modern iOS app it will have enormous amounts of data in the DWARF files. Same goes with javascript webapps which are starting to accumulate larger and larger sourcemaps. All existing tools to the best of our knowledge assume that your debugging happens on a beefy desktop: you debug one or two apps at the time, you can take all the time in the world to preload the debug info files and off you go.

However we're different. We're not handling a single app, we're handling tens of thousands of customers and their apps and websites. We can't just have all the debug info files resident at all times, they are too many. Likewise the events we get constantly are from vastly different versions. I don't have numbers right now but I guess we store terrabytes of debug data we use for processing. Shuffling this in and out and processing it quickly and efficiently required a non trivial amount of engineering to get working as efficiently as it does now.

> Same about a source map. It’s not THAT hard to build a largish lookup table entirely in RAM. Especially if its the MAIN POINT of your business.

Again, you are absolutely correct. However we used this loaded sourcemap for a fraction of a second. So turns out efficiently re-parsing it all the time with a centralized per-frame cache is actually more efficient than trying to keep the entire sourcemap in cache instead in the workers. There are a lot of subtleties you only discover when you get a lot of non uniform event data from real customers.

> I’m curious about that non-linear problem with JSON source map. Could you elaborate?

Sourcemaps are quite misdesigned. The only thing they can really do is token to token mapping. What they cannot tell you is the name or scope of variables if you need to reverse look them up. This is for instance an issue if you get tracebacks from minified function names. The approach we're using here is parsing the minified JavaScript source from the error location backwards in WTF-16 units, then compare that to the token we see in the unminified source and doing a reverse lookup. This requires (unless you build a custom file format which supports some sort of scope based token mapping based on a fully parsed JavaScript AST which we can't do for a variety of reasons, most of which are related to the fact that we need to fetch external sourcemaps and the time for building this custom lookup takes longer than doing a bounded token search) really fast reverse token lexing and seeking around in WTF-16 indexed but UTF-8 encoded source data.

It's not tremendous amounts of work to implement (if you want to see how we do it, it lives here: https://github.com/getsentry/rust-sourcemap/blob/master/src/...) but it means that per frame where we need to inspect an original function name (which is typically about 50% of frames) we might step an indefinite amount of minified tokens back until we find "minified token prefixed with 'function'". We currently cap this search at 128 tokens and hope we find it. This seems to have been okay enough for now, but there are definitely long functions where this is not enough stepping.

> Also you don’t need to cross-reference symbols/events across your customers. The lookup is thus limited to a shard of that client which I’d dare say won’t contain gigabytes of debug symbols.

Ops can better attest to that but we're not huge fans of depending too much on pinning customers to single workers. It makes the entire system much harder to understand, requires running a much higher number of workers than we do and balances badly. We quite like the fact that we can scale pretty linearly with incoming traffic independent of who causes the events to come in.


> The approach we're using here is parsing the minified JavaScript source from the error location backwards in WTF-16 units, then compare that to the token we see in the unminified source and doing a reverse lookup

What if instead of that you’d actually compile all JS files once to some sensible format that would allow you to do lookup of variable in any scope?

Then you only need token to token map to find the proper location in your precimpiled data.

> However we're different. We're not handling a single app, we're handling tens of thousands of customers and their apps and websites. We can't just have all the debug info files resident at all times, they are too many. Likewise the events we get constantly are from vastly different versions. I don't have numbers right now but I guess we store terrabytes of debug data we use for processing.

So how much do you have of it? All in all, precomputing space-efficient index and storing it in some memcached-style solution should deal with that. I bet you don’t need full debug info as in complete DWARF data.

> Ops can better attest to that but we're not huge fans of depending too much on pinning customers to single workers.

Shards can be rebalanced. Also a memcached or whatever DHTs can used to keep shards. This way workers stay uniform but lookups are still fast.


> What if instead of that you’d actually compile all JS files once to some sensible format that would allow you to do lookup of variable in any scope?

An enormous amount of javascript and sourcemaps we are dealing with are one hit wonders and/or need fetching from external sources. We have no impact on the format of the data there and as mentioned converting it once into a different format does not at all help here.

> I bet you don’t need full debug info as in complete DWARF data.

Correct, which is why as part of making processing of iOS symbols faster we wrote our own file format that we can efficiently mmap. It's 10% of the size of the average DWARF, gives us all the information we want without parsing or complex processing (https://github.com/getsentry/symbolic/tree/master/symcache).

The reason I'm writing this is because as you can see we're very far away from "just handling boring events". There is a ton of complexity that goes into it. We're not idiots here ;)

> Shards can be rebalanced.

That changes nothing about the fact, that unless we have to introduce any form of affinity we are better off writing faster code that does not depend on it.


> Correct, which is why as part of making processing of iOS symbols faster we wrote our own file format that we can efficiently mmap. It's 10% of the size of the average DWARF, gives us all the information we want without parsing or complex processing

Exactly, cool that you do it.

> enormous amount of javascript and sourcemaps we are dealing with are one hit wonders and/or need fetching from external sources.

Hm, so you don’t know what code your customers deploy? At least in JS you seem to imply that.

Anyhow pardon me for beeing rude in my early posts.

It’s exciting things you do and I’d love discuss on some other medium. How can I reach you?


> Anyhow pardon me for beeing rude in my early posts.

No worries, no harm done.

> How can I reach you?

armin@sentry.io or @mitsuhiko on twitter should generally work :)


What kind of couch do you have that can fit 6 people? Looking for a new couch.


Sectional, yes. I recommend it, it's very comforable.


Pull out


Sectional.


Serving a static web page 8-50k times per second and doing some really complex processing 8-50k times per second are worlds apart.


I am impressed. Sentry is a brilliant bit of code available for everyone to download and use. It just works and gets better all the time.

And really, in terms of load - at that scale it's all pretty impressive. The payloads are big. They contain the full stack trace along with most the local vars.


We actually do a lot more than 20B (not sure what the number is actually pulled from), but it's also important to note that even 8000 events/sec isn't free.

- 40k per event (average, it gets into 2mb territory for some)

- streaming processing of everything on the fly

- storage/querying of data


I hear you, at the same time you have embarassingly parallel problem that can be easily sharded per customer. (Do I understand your service correctly?)

Processing 8k * 40k is a feat but I’d think you have more then a few dozens of nodes.

Then it gets into ~100s ev/s per node territory. Quite manageable even for e.g. complex search engines to ingress.

Of course, it all depends on where your bottlenecks are.


> I hear you, at the same time you have embarassingly parallel problem that can be easily sharded per customer. (Do I understand your service correctly?)

The actual event ingestion from an external consumer to our queue is indeed very boring. What makes Sentry internally complex is how incredibly complicated those events can get. For instance before we can make sense of most iOS (as wlel as other native) or JavaScript events we need to go through a fairly complex event processing system. A lot of thought went into making this process very efficient.

Also obviously even if a problem like ours can be easily solved in parallel, you don't just want to add more servers you also want to run a reasonably lean operation. Instead of just throwing more machines at the issue we want to make sure we internally do not do completely crazy things but optimize our processes.


This is an interesting line of discussion. How do you know you are optimally utilizing the hardware you do have? For example, is each machine running at near 100% cpu? Is the network card saturated? Is the database the bottleneck? How about the search indexing operation? What metrics are you tracking and how are you tracking them? If you're at liberty to discuss, it would be highly enlightening!


> How do you know you are optimally utilizing the hardware you do have?

That's probably one better to answer for actual ops people. However what we generally do is looking at our metrics to see where we see room for improvement. Two obvious cases we identified and resolved through improving our approaches have been JavaScript sourcemap processing and debug symbol handling.

There we knew we were slow based on how far p95 timing was from the average and median. We looked specifically at what problematic event types looked like and what we can do to handle them better. We now use just better approaches to dealing with this data, multiple layers of more intelligent caches and that cut down out event processing time and with it the amount of actual hardware we need.

So the answer at least as far as event processing goes: for some cases we spent time on we were CPU bound and knew we can do better and did. I wrote a bit about what we did on our blog where we moved sourcemap processing from Python to Rust. The latests big wins in the processing department have been a rewrite of our dSYM handling where we now use a custom library to preprocess DWARF files and various symbol tables into a more compact format so we can do much faster lookups for precisely the data we are about.

> Is the network card saturated?

In other cases that has been an issue, but that's less of a problem on the actual workers which are relevant to that number we are discussing here. In particular as far as the load balancers goes there is a ton more data coming in but not all will make it to the system. A non uncommon case is someone deploys our code into a mobile app SDK and then stops paying. The apps will still spam us, but we want to block this off before it hits the expensive parts of the system etc.

> Is the database the bottleneck?

When is a database not a concern ;)

What we are tracking is a ton, how we are tracking is a lot of datadog.


I’d be curious to see how hardware stacks up against events/second end2end as well.

In fact that would be the single most interesting point.

At least a ballpark of say around 20 machines with recent Xeons with ~ X gb ram to process say those 8k/sec.

P.S. In fact calculation were fairly lax and it should closer to 10k/s.


> I’d be curious to see how hardware stacks up against events/second end2end as well.

That heavily depends on which part of the stack you are looking at. (Event ingestion, processing, load balancers, databases etc.). There is a lot less hardware there than one would imagine ;)


Let’s cut down the slack and focus on processing itself.

It wouldn’t hurt to know about the databases if they are heavily used as part of processing


All event processing (currently) takes place on three 32 core Xeon machines. But it has seen much worse days when we had many more before we optimized critical paths. Likewise what we have now is pretty chill, but we need to also consider the case where a worker drops entirely from the pool so these are over-provisioned for the amount of work they each take.

The data layer is probably more interesting from a scaling point of view in general since it's harder to scale, but it's also not exactly a point we're particularly happy with. There is already a comment from david (zeeg) on this post going into our plans there.


96 core to process 10K complex event/s sounds quite impressive given the kind of analysis you seem to have to do. Did sentry used to have all the stack trace, local variable data originally or that kind of extensive processing a more recent development? I am more used to Java where the basic stack trace is provided by the VM and I assume processing those are much simpler than some of the user kind of events?


> All event processing (currently) takes place on three 32 core Xeon machines. But it has seen much worse days when we had many more before we optimized critical paths.

Finally I could understand what kind of efficiency you guys have. Not bad at all I’d say though not exactly “high load miracle” the article title would imply.


I'm not sure what the article how the article implies a "high load miracle". But we are quite proud about the efficiency gains and some of our solutions for the problems we encounter I think.


The problem is the title. You guys are doing great, actually!


I mean, serving a web page is easy. Processing incoming events and everything else that Sentry does is not trivial. You're not comparing likes to likes.


Yes, but they don’t run the whole thing on a single laptop do they?

Roughly speaking you take what a laptop can do and a multiply it by 1000x to estimate what a cloud service could easily do. (It’s just an estimate, the reality is not working like that, I KNOW)

Do not take it as Nginx and webpages example, it’s indeed a poor analogy.

I can’t remember the right number for e.g. ElasticSearch fuzzy search across 10m documents of out hand so I’d pick the simpler less relevant number.

Actually fuzzy full text search across 10m documents is at least in the range of 100s/sec on a laptop. Meaning that they have resources to do at least e.g. 100k-500k/s full text searches across gigabytes of symbols.

I bet their queries are simpler then that though (not fuzzy nor full text). It’s debug symbols after all not a natural text.


If that's the case why NH littered with these kinds of issues?

https://news.ycombinator.com/item?id=15584315


I mean serving a web page, from nginx, which is what the person was talking about. Not a database-driven web application.


At Treasure Data, we ingest about 1-2 million events per second. Using your #, this translates to ~6 trillion events per month.

If you (or anyone on this thread) is interested in learning more, please feel free to ping me.


How much is a lot for you? 1 billion/s?


At 8k/s given a bunch of hardware a typical cloud service will throw at it it can be down to ~100ev/s per node.

Yes I’d like at least 1k/s per node. That’s something interesting.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: