Hacker News new | past | comments | ask | show | jobs | submit login

I don't understand how can they be as fast as regular containers if they run an entire kernel on top of an hypervisor ?



I don't have numbers either, but it's a combination of extreme focus on the boot path and virtio drivers, and traditional containers now being quite heavyweight to start (especially when run via Kubernetes).

The big problem with Katacontainers is not whether or not they are slightly faster or slower than containers, but the fixed memory allocation which means you must first know and then allocate the maximum amount of memory they might ever need up front. This can practically limit the number of Katacontainers you can run to something much smaller than is possible with ordinary containers, since RAM is the constrained resource on most servers.

Nevertheless, with confidential computing coming along, it's likely that at some point in the future many containers will really be VMs, since current CPUs implement confidential computing on top of existing VM primitives (and that's basically necessary due to the way the guest RAM is encrypted). It's likely that any workload that touches PII, finance, health, etc will be required to use confidential computing.


> but the fixed memory allocation which means you must first know and then allocate the maximum amount of memory they might ever need up front.

Yup, that's always been the big reason to use containers for me. Startup time and runtime performance are nice benefits, but the memory usage is the giant win. Freeing memory in response to the apps need and also not needing extra memory for running the various OS parts and pieces.

The down side is, of course, security. But that was always the case with containers.


I think in the longer run, WASM might displace a lot of both in practical terms.


Why would wasm replace containers? If you're going to run a binary why not just compile it for the local system?

We've always had 'compile once, run anywhere' but there's always been caveats and gotchas.


Something still compiles a WASM binary for the local system. Possibly, being able to optimize the WASM without recompiling it from source might be a win? Not needing separate binaries for ARM and x86 is nice, so it should run on a Mac more easily. Also, it runs on an edge server or in a browser, even on a phone, if you care about that.

I don’t think it will replace Docker files since they let you package up such a wide variety of existing server software and WASM is more limited. But if your software does compile to WASM then maybe you don’t care about that.

I think of WASM more like a plugin format, but I expect there will be a lot of engineering effort put into optimizing it, like happened with V8 for JavaScript. Not all web standards win, but betting against one that’s well-established and has a lot of support seems like a mistake.


wasm targets the wasm runtime Virtual Machine (ie: a JavaScript VM). Offering fine grained isolation compared to virtualizing the whole operating system.

edit: don't shoot the messenger. I was merely highlighting the main difference between native and webasm in the context of the discussion.


You could say the same thing for anything that targets JVM or CLR, and they're far more mature than any JS runtime.


Or even Lua, which is trivial to sandbox.


For serverless (as in AWS-lambda-like) I agree. In that usecase WASM provides a better security barrier than containers, with faster cold-start time (which is really important for the scaling promise of these services).

For the stuff people run on their Kubernetes clusters I have more mixed expectations. Containers are more universal, but I can totally see a microservice architecture running as a lot of WASM runtimes with a handful of containers.


Particularly in the case of Amazon lambdas, those are running in VMS already (firecracker). Why wouldn't you skip the VM and use a static binary from rust instead?


> The big problem with Katacontainers is not whether or not they are slightly faster or slower than containers, but the fixed memory allocation which means you must first know and then allocate the maximum amount of memory they might ever need up front.

Conversely the problem with containers is that memory allocation including the OS page cache is not guaranteed. That's bad for a lot of applications, especially databases. It seems Docker has some support for shared page cache but it's not in the Kubernetes pod spec as far as I can see. [0] You would probably need some kind of annotations and a specialized controller to make this work.

[0] https://github.com/kubernetes/kubernetes/issues/43916


In Kubernetes 1.25+ page cache usage accounting is improved thanks to use of cgroupsv2. https://kubernetes.io/docs/concepts/architecture/cgroups/


Kata containers support memory ballooning like most modern VMs: https://en.wikipedia.org/wiki/Memory_ballooning so a fixed allocation isn't needed, reducing over provisioning

https://github.com/kata-containers/kata-containers/blob/d50f... uses virtio-mem


This isn't a substitute (nor is virtio-mem, the modern equivalent). The problem is the application running in userspace inside the guest cannot request more memory when, for example, it does a mmap or sbrk.


Which application languages and frameworks support this kind of dynamic memory allocation? For predictability in performance and throughput reasons we benchmark our Java applications on specific cpu and memory constraints and specific heap and memory settings. How would an app in a container suddenly give back ram? A garbage collected application may be able to do that by collecting garbage. Possibly. But others?


No idea about Java, but any C program will request memory using mmap, and may give it back using munmap. This doesn't work when the program is running inside a VM, but does work for containers (which are basically just regular processes).


How does this work in practice? If the application didn’t need this memory anymore because it is done with the work/data, shouldn’t it already have freed or munmapped it? Is there a signal that can be send to a process to free up and return memory?


With VM balloning, VMs are also able to claim and release memory to the hypervisor/host OS.


Not driven by the guest application they don't. It's frustrating that I actually work in this area and know what I'm talking about. I worked on Katacontainers back when it was Intel's Clear Containers in the late 2010s. So many people in this thread do not have a clue.


Interesting, that's a very annoying constraint then


i couldn't understand the comment that "the application running in user space cannot request more memory" - can someone explain whats the point of memorybalooning anywhere if an application cannot signal when the system should actually provision physical memory from the 'baloon'


There isn't a point, that's the problem.


The systems administrator can use ballooning to give more memory to a VM before launching a new application. This avoids the need to shut down the VM to give it a new role.

There is still a benefit to ballooning support even if it's not exposed to userspace within the VM, because VMs aren't always used purely to host a single infinitely-long-lived application without outside intervention.


Linux supports hot attach ram or the VMM could support memory ballooning. VMs don’t necessarily need to all be backed by physical RAM.


Slightly off topic, but regarding the larger memory footprint of Kata containers, what is your opinion on KSM effectiveness in general for VMs?


We had a lot of reports of ksm/ksmtuned consuming a lot of CPU and not making a lot of difference. I think it works well for certain workloads, and can be quite pessimal for others. There are also security concerns because you can leak information about (eg) what glibc is being used by another tenant using timing attacks. So you'd probably want to turn it off if multiple tenants can be using a single node.


there's probably a big asterisk there. the correct term is probably "fast enough"

virtualization adds very overhead, a Windows VM running with a dedicated GPU can get 95% of the host's score on 3dmark.

the biggest issue on these cases is IO which can be handled in a few ways.


> virtualization adds very overhead

Missing word?

"Very little"? 5% is enough to turn this year's high-end machine into last year's model.

Nested virtualisation is also a thing now, and 5% per layer adds up fast.


The short version is that kernels support guest/host relationships natively so that guests can pass operations directly to the host without having to go through an additional system call. Everywhere you do this is attack surface where an attacker in the guest can communicate with privileged facilities, so you want to minimize this where you can.

There's usually overhead in the places where the communication requires an additional hop. If you want your host filesystem isolated you're going to need a translation layer and it will be slower. If you're willing to open up your host OS's filesystem, you can basically get ~0 overhead.


there is pretty low overhead if you are opinionated - this is very similar to firecracker (AWS) tooling, so cut down hypervisor with ~ 0 devices, and a cut down guest OS means pretty quick boot times


Yeah I'd like to see some numbers on that, like startup time.


Depending on your use case there's potentially negligible startup time. On the scale of single digit seconds to less then half a second depending on how much work you put into optimizing it. For some applications this will be too slow (mainly the type where you boot a container per request, although flyio seems to make it work), I think for a _lot_ of applications this wouldn't be noticed.

Kata gives you a few different options for what/how you'd like to boot including firecracker.

This isn't exclusive to firecracker but if you stay lightweight you can have vm's booting under a half second if you're using slim images.

https://jvns.ca/blog/2021/01/23/firecracker--start-a-vm-in-l...

I honestly think for a lot of people, vm's with the convenience/orchestration tools of containers make more sense for a lot of general use cases simply because of the security benefits. The convenience still needs some work though.


Unless you're dealing with a multi-tenant situation I'm not super convinced that a VM is worth the effort. It's not the perf, it's the need to make your kernel, root file system, and other infra needed to make it all work.

Compare that to a docker container where there's basically 0 additional work that has to be done to be up and running.

For most cases I'd be really tempted to work on hardening the docker container than on setting up a VM. Things like Apparmor and seccomp in particular would likely go a very long way.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: