What is going on at Azure?! First the Cosmos DB vulnerability [1] three weeks ago and now this?! It’s starting to look like a systemic issue at Microsoft.
tl;dr
- Microsoft uses an old version of runc (from 20216), with known vulnerabilities. runc is a container runtime. Researchers, using work from 2019, were able to break out of the container into the Kubernetes node.
- Microsoft, to fix a known vulnerability (instead of patching!) in Kubernetes with `kubectl exec` has a custom solution (`bridge-pod`) that handles sending `kubectl exec` commands to nodes.
- In Microsoft's custom solution, the HTTP request to run `kubectl exec` on the node sent from `bridge-pod` includes an access token with overly broad privileges in the header (not necessary in this case). If you monitor traffic once you've escape the container on the node, you can extract this access token if you send a command to your own container from the Azure CLI.
- Access token grants `pod/exec` rights to anything in the cluster
The egregious part is the `bridge-pod` and it's implementation. This wasn't just outdated software left unpatched, it was badly implemented hacks to work around their inability to patch.
The root cause was a chain of known exploits due to very out of date software:
* RunC v1.0.0-rc2 was released on Oct. 1, 2016, and was vulnerable to at least two container breakout CVEs.
* ACI was hosted on clusters running either Kubernetes v1.8.4, v1.9.10 or v1.10.9. These versions were released between November 2017 and October 2018 and are vulnerable to multiple publicly known vulnerabilities.
Running multitenant workloads in Kubernetes is notoriously difficult, and staying on top of patches is simply table-stakes.
It's not difficult, it's reckless bordering on incompetence. The shared kernel is just way to much attack space and shared resources to make it tenable in the face of adversaries.
Azure claim a bunch of standards, inc the various ISO, SOC, etc as well as claiming HIPAA et al. Now, I haven't checked this service is in scope, but I would presume it is.
Given such an obvious piece of external evidence that the IMS has failed (not keeping the software up to date, internal audit not detecting the software is out of date,etc, all the way through to an external incident occurring due to this failure) you have to wonder whether there should be some greater repercussion/ outcome than just "we fixed the vuln".
I would very much want to hear what they're doing to fix eg. the internal audit. Root cause on this is a management failure.
At that scale "up to date" is not the same as for a standard tech company. It's possible that they run old but custom patched versions in the same way RedHat shipped old kernels with hundreds of patches on top.
It certainly is a management failure, but people jumping to conclusion here about old versions may be missing the mark.
That's why I didn't say they do backport anything. I don't know how they run that service internally. But knowing how some large customer-facing services are run, I wanted to caution against calling out the "out of date" software that happened in many comments. Sometimes you can't "just update things" and rely on other ways to mitigate.
Unless the backported fix didn't actually remove the vulnerability. Or the vulnerable systems reported as being fixed without actually being so because of another bug. There are a lot of imaginary situations that could've happended.
There should be an in-depth analysis from this, but I don't think it's safe to really draw conclusions about their processes from what we know from this exploit alone.
Given that the Azure notification provided zero details on the patching, I think we can apply Occam's razor here quite effectively.
One the one hand we have an old version with a known vulnerability that was successfully exploited.
On the other hand we have a backporting process for 5 years that applied fixes to this and a large number of other issues but missed the exploitable vulnerability.
Sure we can come up with imaginary scenarios where the second option is possible, but when reasoning about a scenario from limited information I generally go with the simplest possible explanation.
If you start of using Occam's razor and skip the analysis entirely as you're seemingly fine with you'll just end up with a confirmation bias by only looking at signals which collaborate with what you want to be the truth.
Sure and if we were skipping the analysis here, I'd agree with you.
However we have several signals from the researcher side, and no signals (that I've seen) from the vendor side and all of the signals point to a lack of patching.
In this case are there any signals you've seen that point to silent backporting that missed the exploited vulnerabilities?
Does Azure use Firecracker like AWS Fargate? It doesn't sound like it. Firecracker isolates the execution environment for every container to a unique VM and network, so breakouts would require a VM breakout.
I have to say, this is so bad/out of date that I wouldn't use ACI for any business needs until I see a live dashboard with the versions of the software they're running so I can look for CVEs. I might as well host my business apps on an early 2000s shared web hosting provider.
From a quick skim this is pretty embarrassing for Azure. I mean one of the main supposed advantages of cloud is that you don't have to worry about patching and updating the underlying infrastructure. But now it turns out that maybe you do!
I have bothered, and unfortunately the features are more marketing than substance.
So many features that ought to be orthogonal, but aren't. Remind me again why I can't have an IPv4 NAT gateway in an IPv6-enabled vNet?
Kubernetes is also not the only area where Azure is woefully behind on updates. They haven't even kept up with Microsoft products. For example, they still use the deprecated VHD format, even though VHDX has been a thing for 5 years now.
I think big cloud vendors are more or less as bad as each other, for the simple reason that they are all megacorps - where the "enterprise" mindset is made inevitable by the sheer number of employees; that's going to bite, sooner or later. It's mostly a case of figuring out the worst ones (Oracle, IBM...) to reduce the chances of a fuckup - but a certain amount of risk will always be there.
If one can, one should probably "go boutique", where the power balance is a bit more favourable to customers and vendors do really care about their products and their reputation.
It looks like they're defining CaaS as the provider completely managing the underlying infrastructure.
> "With ACI, customers can deploy containers to Azure without managing the underlying infrastructure"
ECS was launched on 2014, but was implemented as an agent on an EC2 instance. AWS's completely managed container solution (Fargate) went GA in Nov 2017.
tl;dr
- Microsoft uses an old version of runc (from 20216), with known vulnerabilities. runc is a container runtime. Researchers, using work from 2019, were able to break out of the container into the Kubernetes node.
- Microsoft, to fix a known vulnerability (instead of patching!) in Kubernetes with `kubectl exec` has a custom solution (`bridge-pod`) that handles sending `kubectl exec` commands to nodes.
- In Microsoft's custom solution, the HTTP request to run `kubectl exec` on the node sent from `bridge-pod` includes an access token with overly broad privileges in the header (not necessary in this case). If you monitor traffic once you've escape the container on the node, you can extract this access token if you send a command to your own container from the Azure CLI.
- Access token grants `pod/exec` rights to anything in the cluster
The egregious part is the `bridge-pod` and it's implementation. This wasn't just outdated software left unpatched, it was badly implemented hacks to work around their inability to patch.
1 - https://www.wiz.io/blog/chaosdb-how-we-hacked-thousands-of-a...