Cross-Account Container Takeover in Azure Container Instances

ctvo · on Sept 12, 2021

What is going on at Azure?! First the Cosmos DB vulnerability [1] three weeks ago and now this?! It’s starting to look like a systemic issue at Microsoft.

tl;dr

- Microsoft uses an old version of runc (from 20216), with known vulnerabilities. runc is a container runtime. Researchers, using work from 2019, were able to break out of the container into the Kubernetes node.

- Microsoft, to fix a known vulnerability (instead of patching!) in Kubernetes with `kubectl exec` has a custom solution (`bridge-pod`) that handles sending `kubectl exec` commands to nodes.

- In Microsoft's custom solution, the HTTP request to run `kubectl exec` on the node sent from `bridge-pod` includes an access token with overly broad privileges in the header (not necessary in this case). If you monitor traffic once you've escape the container on the node, you can extract this access token if you send a command to your own container from the Azure CLI.

- Access token grants `pod/exec` rights to anything in the cluster

The egregious part is the `bridge-pod` and it's implementation. This wasn't just outdated software left unpatched, it was badly implemented hacks to work around their inability to patch.

1 - https://www.wiz.io/blog/chaosdb-how-we-hacked-thousands-of-a...

dlor · on Sept 12, 2021

The root cause was a chain of known exploits due to very out of date software:

* RunC v1.0.0-rc2 was released on Oct. 1, 2016, and was vulnerable to at least two container breakout CVEs.

* ACI was hosted on clusters running either Kubernetes v1.8.4, v1.9.10 or v1.10.9. These versions were released between November 2017 and October 2018 and are vulnerable to multiple publicly known vulnerabilities.

Running multitenant workloads in Kubernetes is notoriously difficult, and staying on top of patches is simply table-stakes.

pathseeker · on Sept 12, 2021

It's not difficult, it's reckless bordering on incompetence. The shared kernel is just way to much attack space and shared resources to make it tenable in the face of adversaries.

raesene9 · on Sept 12, 2021

Yep given the number of CVEs in Kubernetes, and to a lesser extent runc, (https://www.container-security.site/general_information/cont...) it's pretty surprising to see a major cloud service running versions this old.

Arnavion · on Sept 12, 2021

And the Bridge SSRF one appears to be because someone was constructing URLs via concatenating instead of using a URL serializer / encoder.

Eg https://play.rust-lang.org/?version=stable&mode=debug&editio... - A serializer not only detects characters that are completely illegal, but also escapes characters that need escaping to be legal.

ealexhudson · on Sept 12, 2021

Azure claim a bunch of standards, inc the various ISO, SOC, etc as well as claiming HIPAA et al. Now, I haven't checked this service is in scope, but I would presume it is.

Given such an obvious piece of external evidence that the IMS has failed (not keeping the software up to date, internal audit not detecting the software is out of date,etc, all the way through to an external incident occurring due to this failure) you have to wonder whether there should be some greater repercussion/ outcome than just "we fixed the vuln".

I would very much want to hear what they're doing to fix eg. the internal audit. Root cause on this is a management failure.

viraptor · on Sept 12, 2021

> not keeping the software up to date

At that scale "up to date" is not the same as for a standard tech company. It's possible that they run old but custom patched versions in the same way RedHat shipped old kernels with hundreds of patches on top.

It certainly is a management failure, but people jumping to conclusion here about old versions may be missing the mark.

raesene9 · on Sept 12, 2021

In the article you can see that they actually exploit the old issues, which seems to strongly indicate that this is not a case of back porting.

viraptor · on Sept 13, 2021

That's why I didn't say they do backport anything. I don't know how they run that service internally. But knowing how some large customer-facing services are run, I wanted to caution against calling out the "out of date" software that happened in many comments. Sometimes you can't "just update things" and rely on other ways to mitigate.

y4mi · on Sept 12, 2021

Unless the backported fix didn't actually remove the vulnerability. Or the vulnerable systems reported as being fixed without actually being so because of another bug. There are a lot of imaginary situations that could've happended.

There should be an in-depth analysis from this, but I don't think it's safe to really draw conclusions about their processes from what we know from this exploit alone.

raesene9 · on Sept 12, 2021

Given that the Azure notification provided zero details on the patching, I think we can apply Occam's razor here quite effectively.

One the one hand we have an old version with a known vulnerability that was successfully exploited.

On the other hand we have a backporting process for 5 years that applied fixes to this and a large number of other issues but missed the exploitable vulnerability.

Sure we can come up with imaginary scenarios where the second option is possible, but when reasoning about a scenario from limited information I generally go with the simplest possible explanation.

y4mi · on Sept 12, 2021

If you start of using Occam's razor and skip the analysis entirely as you're seemingly fine with you'll just end up with a confirmation bias by only looking at signals which collaborate with what you want to be the truth.

raesene9 · on Sept 12, 2021

Sure and if we were skipping the analysis here, I'd agree with you.

However we have several signals from the researcher side, and no signals (that I've seen) from the vendor side and all of the signals point to a lack of patching.

In this case are there any signals you've seen that point to silent backporting that missed the exploited vulnerabilities?

throwaway984393 · on Sept 12, 2021

Does Azure use Firecracker like AWS Fargate? It doesn't sound like it. Firecracker isolates the execution environment for every container to a unique VM and network, so breakouts would require a VM breakout.

I have to say, this is so bad/out of date that I wouldn't use ACI for any business needs until I see a live dashboard with the versions of the software they're running so I can look for CVEs. I might as well host my business apps on an early 2000s shared web hosting provider.

anonymousDan · on Sept 12, 2021

From a quick skim this is pretty embarrassing for Azure. I mean one of the main supposed advantages of cloud is that you don't have to worry about patching and updating the underlying infrastructure. But now it turns out that maybe you do!

peakaboo · on Sept 12, 2021

It's Microsoft. The other cloud providers are lightyears ahead so maybe switch to a proper one.

discordance · on Sept 12, 2021

Are they? How so?

atatatat · on Sept 12, 2021

Are there any tools that are as easy to use as Azure's?

Anything close? AWS is overly wordy cluttered water trash.

ZeroCool2u · on Sept 12, 2021

I use AWS and GCP for work. GCP's UX is what keeps me there for personal projects.

hunterkll · on Sept 12, 2021

AWS or Oracle?

I'll take my chances with microsoft. Better features & much lower cost....

Azure's light years ahead in MANY areas, if you bother to take a look and learn about it.....

jiggawatts · on Sept 12, 2021

I have bothered, and unfortunately the features are more marketing than substance.

So many features that ought to be orthogonal, but aren't. Remind me again why I can't have an IPv4 NAT gateway in an IPv6-enabled vNet?

Kubernetes is also not the only area where Azure is woefully behind on updates. They haven't even kept up with Microsoft products. For example, they still use the deprecated VHD format, even though VHDX has been a thing for 5 years now.

toyg · on Sept 12, 2021

I think big cloud vendors are more or less as bad as each other, for the simple reason that they are all megacorps - where the "enterprise" mindset is made inevitable by the sheer number of employees; that's going to bite, sooner or later. It's mostly a case of figuring out the worst ones (Oracle, IBM...) to reduce the chances of a fuckup - but a certain amount of risk will always be there.

If one can, one should probably "go boutique", where the power balance is a bit more favourable to customers and vendors do really care about their products and their reputation.

pushkar2911 · on Sept 12, 2021

This looks a like a design flaw in Azure container service.

Engineer working on GVisor for Google had this to say https://twitter.com/IanMLewis/status/1436390377716523011

gigatexal · on Sept 12, 2021

Yikes the DB takeover and now this? Bad month for security at Azure.

nucleardog · on Sept 12, 2021

> Azure Container Instances (ACI) was released in July 2017 and was the first Container-as-a-Service (CaaS) offering by a major cloud provider.

Amazon’s ECS launched in 2014. Is that not container hosting as a service?

whoknew1122 · on Sept 12, 2021

It looks like they're defining CaaS as the provider completely managing the underlying infrastructure.

> "With ACI, customers can deploy containers to Azure without managing the underlying infrastructure"

ECS was launched on 2014, but was implemented as an agent on an EC2 instance. AWS's completely managed container solution (Fargate) went GA in Nov 2017.

x3n0ph3n3 · on Sept 12, 2021

All thanks to 2+ year old software with known vulnerabilities.

hakre · on Sept 12, 2021

You normally take a cloud offering so that you can truly deny that you didn't know it was not state of the art.