I think the machine learning aspect is the wrong choice here. I'm seeing another glaring answer, and one that can be done fairly quickly: Integral analysis.
The servers that have error accumulation aren't always the ones highest in spikes, as shown in the link. However, they do consistently stay above others in amounts of errors. So the answer is thus:
Since your time series is quantized, the integral is simply a sum over the timeframe. I would recommend multiple time windows, like "10 seconds, 3 minutes, 1 hour, 12 hours". Do this for all your servers.
Now, you have a normalized 2d graph, with respect for time. We can now calculate the mean of all servers' error areas as well as standard deviation. Now, for reducing CPU load, one can run a modified thompson tau test on +/-2σ for outlier detection. 4σ is around 97.7% of all your data, so you would only be checking ~2.3% of your data.
When a server fails an outlier test (in other words, is detected as an outlier), the historical data can also be investigated. Is the machine doing gradually worse? If so, the machine could be removed from service until a memory and CPU test can be run. You can also keep anomaly detection on how many outliers per hour. Passing a threshold could indicate potentially erroneous machines.
My answer also assumes that the machines are equal in the amount of load. I'm sure this is not the case, in which you will have to normalize errors/# of requests. This shouldn't be a problem.
I noticed that this (density-based clustering) is conceptually very different from an earlier Netflix approach (RAD - http://techblog.netflix.com/2015/02/rad-outlier-detection-on...) which performs RPCA and estimates thresholds. Are both systems in place? In which cases would one system outperform another?
RAD requires at least 2 cycles of data, whereas the real-time analytics team tends to operate on very short time frames. Our approach is less accurate but able to make quick decisions, which is the objective in this case.
Both systems are certainly in place, though they serve different business use cases. RAD isn't used in real-time for operational decisions and Kepler isn't used for anything that Data Science uses RAD for.
There are a lot of online methods for reinforcement learning, and one could argue that reinforcement learning is doing exactly what you want for outlier detection: doing a cost benefit analysis of the value of exploring a given context in more or less detail.
Have you guys thought at all about adapting a reinforcement learning model for building an online model to detect outliers?
We have, specifically in the domain of using RL for parameter selection. Part of the reason I was brought on board was background experience in RL. I fully believe it's the key to turning domain (yes/no) decisions into numbers for our algorithms to ingest.
Ultimately we would like all of our systems to use RL to interpret user feedback to determine parameters, actions, etc...
Service owners wanted the system to be able to identify these problems as quickly as possible so the initial goal was to go as small as possible, however the initial data source rolled data up into 1 minute windows so we figured 5 minutes was the shortest we could get away with. In practice this seems to work well for service owners as it avoids calling out small spikes as outliers.
We've got an experimental streaming version going that hasn't been set loose on any services yet, but it can get much higher granularity metrics ~ 10s (faster if we cared to).
edit: Forgot to add that we did try other window sizes, as long as 30 minutes but we found that longer windows allowed the past to influence the decision being made now too much. If it has spiked in the past we were aggressive about calling it an outlier with 30 minute windows, furthermore if it had been in lying and just become an outlier it killed our time to detect which is an important metric for us.
At the risk of sparking off-topic debates about Netflix's content selections...are the same event streaming/data processing frameworks also used to detect trends and make actions on the content/frontpage side? Such as, detecting an anomalous spike of users watching some obscure movie in the last 24 hours, and then updating the "Trending Now" queues for users systemwide? Or is that something that happens (relatively) slow enough not to require an elaborate real-time framework?
I don't work on the recommendations side of things, but this is actually the ONE technical topic at Netflix that I don't believe I am allowed to discuss. Sorry!
Very cool work! You mentioned that in a lot of cases, basic thresholds didn't work very well. I understand that setting a direct cutoff would trigger many false positives or not enough true positives, but what about time-based thresholds?
I had always figured timed filters like "above X error rate for Y occurrences in Z time" may work. Threshold filters can be much more complex, I'm curious if you tried different types of threshold filters and if they worked or not.
One of the more complex situations that this solution attempts to address is the situation where the entire cluster surges upwards.
Imagine that service A depends on service B, and service B begins experiencing an issue. The outlier detection system that service A has setup will see the entire cluster surge upwards on errors. This would trip a traditional threshold based approach, but the last thing we'd want to do is terminate servers in service A. In this case the entire cluster moving upwards sets the context that there is a larger scale problem going on (nothing is outlying), whereas if only a few move upwards (become outliers) we have a localized issue we can fix.
Meanwhile service B through either outlier or threshold alerts is paged and dealing with their issue.
As a side note we do something similar to time sensitive thresholds when doing threshold based alerting on differences between the observed data and a double exponential smoothing fit. Parameter selection is an issue at times in this case, we talk about this a bit in an older blog post on Stream Starts per Second (SPS): http://techblog.netflix.com/2015/02/sps-pulse-of-netflix-str...
What is considered an "error"? Are there further classifications for different types of errors? Maybe there are certain error types that contribute more to allow you to detect bad servers? Seems like just having a single dimension is a bit heavy-handed. Would multiple dimensions slow it down too much to be useful?
The problems are "error". Anything else may be a root cause, which can range from failing hard drives, CPU failure, memory failure, NIC failure, rat chewing on cables, or plenty of other modes.
I would say that we don't care about root cause when the event happens; just get the server out of the pool. RCA can be done post mortem.
I would reduce dimensionality down to 2d: errors per time. In that case, we have a great deal of statistical tools at hand. They also do not require hand-waving of N dimensional cluster detection, where only the machine has any idea of detecting errors. And having something like 1000 dimensions is just tremendously slow, compared to integral analysis of errors in respect to time.
I didn't consider statistical process control for this task when I did my initial research. It's certainly something I wish I had read about earlier. I have however since started investigating it for a related real-time project that I hope will someday manifest itself on the tech blog as well.
In "Parameter Selection" you say "We simplify this by having service owners define the current number of outliers, if there are any, at configuration time."
How do service owners get this data? by manual inspection of graphs like the ones shown earlier?
We tend to operate on sever performance metrics, error rates, and networking metrics. We've found through practice that these metrics tend to reveal most issues that we're targeting.
If you meant in the more mathematical sense we perform our clustering in a normalized euclidean space.
It appears to cluster on a particular server metric (e.g. % CPU used), for readings of that metric across a group of servers. One of the example graphs had 'errors per servers,' and I'd presume each color on that graph represents a different server.
It's just a matter of collecting and tagging enough. We've been working on it with our new operational data tagger. Keep an eye out for a future post! :)
Are the results fed to automated remediation systems, or raised for operator interventions (or both)? If automatic, what restraints do you have against too many servers being called outliers?
The answer is most certainly "both". The system can be used to identify problems you don't know you had (generally sending an email / page). Once you've identified a problem you may have it pull instances out of our discovery service but keep them alive so you can ssh in and triage. Once the problem has been identified you can have it mop up by terminating the services displaying the behavior while you work on a fix.
We throttle automatic terminations so that it doesn't drop an entire cluster at once. Yet to cause an outage, fingers crossed!
We generally run into two classes of errors: 1) Software bugs which follow the process outlined above. 2) Issues with AWS... an example being some virtual servers running on hardware experiencing a network issue. If they're terminated and replaced by the ASG generally the new ones spin up on good hardware and we've avoided the issue. Rare but it does happen at our scale.
Have you heard of Numenta? Their machine intelligence algorithms are great for anomaly detection in streaming data, including a product "Grok" for IT analytics (http://numenta.com/grok/). And all open source in NuPIC: https://github.com/numenta/nupic
This was something I looked into when i performed the initial investigation for this project. It was a bit difficult to locate supporting academic material on the algorithm though. The white paper on the page seemed more like marketing material than an academic paper, which I imagine serves their business purposes better.
I will say the fact that nupic produces an outlier score and confidence score are things that would have been incredibly useful by the time this was brought to its end users. Definitely worth a look for anyone looking to do realtime stream processing for anomaly detection.
This is very interesting. Looking at the data though it is classified as "Errors per server". It isn't really disclosed what variables this figure entails but I'd have to imagine adding more information than simply error counts would improve the separability of the data?
It would certainly help separability to observe the data in a higher dimensional space, however when you're taking automated action against the results its sometimes pertinent to know which specific metric is causing the server to be an outlier.
For example if network tcp retransmits are throwing it off we probably just want the system to kill it and let the autoscaling group bring up another server. If its memory usage we probably want to page someone.
This is very interesting and a source of frustration I have with New Relic and the other alerting services we use. New Relic uses a 5-minute rolling average for error rates, and alerts when that average goes above some threshold. However, that means that it takes ~5 minutes from a spike occurring to an alert being created - even if the error rate has increased to 50%.
It would be much better for it to be doing this sort of outlier detection - a gradual increase in error rate to 3% should not trigger a critical alert, whereas a big jump in error rates should trigger an alert quickly.
Would be interesting to try and fit a distribution to error rates (some type of counting process) and then monitor the probability of having the occurred number of errors (with in some period of time). Then low probability events might indicate an outlier.
We have another component of the same outlier detection system that does this type of fitting, identifying low probability events using a Bayesian model + Markov Chain Monte Carlo. It hasn't gained nearly as much traction internally (yet) as the clustering approach here.
Why not take this a step further and have another system try to diagnose the issue with the identified outlier server before handing it over to the alert system? Seems like you'd be generating a lot of alerts otherwise.
Our central alerting gateway (CAG) does much more than just send emails and pages. When the system was fist built we hooked into CAG directly as it is capable of taking many of the desired actions on its own.
The data looks like it could lend itself to an approach where you model the error rate based on the prior data (at simplest, get a mean and variance out of it) and then use a Chi-square critical range check to see if the last n (degrees of freedom in the check) measurements are likely to have come out of the modelled distribution. Is that something you've considered?
We have considered modeling the distribution from which the data is typically drawn and then calculating likelihood of newly observed data. Some of the approaches that we use to detect anomalies on stream starts per second (SPS) now depend on these services. Same software package, slightly different solution.
A colleague of mine (Chris) implemented a data tagger which allows users to annotate data that is typically fed into this system. We have plans to have the backend automatically swap out the algorithm based on performance against their tagged data.
You could also have the client run some tests and auto report problems such as the bad server's id.
You could also have a manual report button the user can click on. And if you want to get really advanced, have a system that learns from those manual reports so that it can later warn when there is a high probability of a "problem" occurring.
One of my colleagues designed and implemented a telemetry data tagger that allows service owners to annotate the data reported into our primary telemetry system Atlas [0] and has further worked on mechanisms for service owner feedback to automatically create training data sets for this implementation.
Great question, and I'm glad others are thinking along these lines -- we fully believe that this is important for bridging the gap between service owners and our analytics.
I think the graphs would be better presented differently.
Instead of a line graph, try different opacities. So you have one line per server, going left-to-right, shading from white (for no errors) to whatever color (for the maximum number of errors overall). And perhaps a dot per server and a single line indicating overall status.
Regarding devops alert spam, on the one hand it's possible to setup rules and filters and tweak notifications just right, but that is often an initial hassle.
Possibly a startup opportunity to offer an easier option with error logs + machine learning.
You absolutely can. The FFT is one cycle of a Discrete Fourier Transform, and since this data is both discrete and in regular intervals, it should behave identical to just sampling an audio signal.
I should note that the team responsible for this (Insight Engineering) is hiring a manager[0] and engineers[1]. If you're interested in solving this kind of problem come check out the job post as I'm really excited to be building out our team.
How does someone get into doing work like this? This sounds like it would be pretty cool. I started my first job working with infrastructure/ops for a larger scale app and it's been really interesting.