This is an interesting line of discussion. How do you know you are optimally utilizing the hardware you do have? For example, is each machine running at near 100% cpu? Is the network card saturated? Is the database the bottleneck? How about the search indexing operation? What metrics are you tracking and how are you tracking them? If you're at liberty to discuss, it would be highly enlightening!
> How do you know you are optimally utilizing the hardware you do have?
That's probably one better to answer for actual ops people. However what we generally do is looking at our metrics to see where we see room for improvement. Two obvious cases we identified and resolved through improving our approaches have been JavaScript sourcemap processing and debug symbol handling.
There we knew we were slow based on how far p95 timing was from the average and median. We looked specifically at what problematic event types looked like and what we can do to handle them better. We now use just better approaches to dealing with this data, multiple layers of more intelligent caches and that cut down out event processing time and with it the amount of actual hardware we need.
So the answer at least as far as event processing goes: for some cases we spent time on we were CPU bound and knew we can do better and did. I wrote a bit about what we did on our blog where we moved sourcemap processing from Python to Rust. The latests big wins in the processing department have been a rewrite of our dSYM handling where we now use a custom library to preprocess DWARF files and various symbol tables into a more compact format so we can do much faster lookups for precisely the data we are about.
> Is the network card saturated?
In other cases that has been an issue, but that's less of a problem on the actual workers which are relevant to that number we are discussing here. In particular as far as the load balancers goes there is a ton more data coming in but not all will make it to the system. A non uncommon case is someone deploys our code into a mobile app SDK and then stops paying. The apps will still spam us, but we want to block this off before it hits the expensive parts of the system etc.
> Is the database the bottleneck?
When is a database not a concern ;)
What we are tracking is a ton, how we are tracking is a lot of datadog.