A properly written vmm/hypervisor should have no attack surface. This won't be that since there's a rest api on LXD, which is one attack surface.
DoS is still a problem but containers should provide mitigation for that. You can make the vmm prevent DoS, but it's better to keep the vmm small and light.
As for local kernel privilege escalation, yes, it would still run, but it might not matter. In theory, the VMM can isolate all virtual machine resources such that rooting a VM only gives you that VM. I can't figure out how they extend that protection to containers yet since VT-x was made for full virtual machines and containers share a kernel.
Well -- 'no attack surface' might be simplifying things a bit too much, as you do need a way to interact with the hypervisor or the privileged host to actually get your data written to disk and your network packets out on the wire.
Each such interaction can contain bugs, some of which might be exploitable.
Even with Xen or KVM you do have an attack surface:
* guests can send network packets to the host, which interacts with the networking code on the host. If exploitable you get to execute code/DoS the host. Hopefully not because then so could any other remote machine.
* guests can execute instructions which get emulated / need extra privilege checks done in the hypervisor. See recent vulnerability regarding MSR registers in Xen.
* guests execute hypercalls which obviously interacts with the hypervisor. Bugs here, if exploitable, can be nasty.
* guests need to read/write their data to disk. Are we sure they can't read the data of a (possible already deleted) other VM?
* guests read/write from memory ... was the memory of previously deleted/crashed/migrated guests properly scrubbed? Can any of the hypercalls/etc. be used to read another guest's memory, or access uninitialized memory containing pieces from old guests?
...
Of course the attack surface of a hypervisor is smaller than that of a full kernel (where you also have a lot of syscalls, disk formats, etc.), but that doesn't mean hypervisors are suddenly bulletproof.
The question is where does LXD stand from a security pov. between these simplified categories (no order implied):
- running multiple different processes as same user
- running processes in different LXC containers as root-in-container on same host
- running processes in different LXC containers as non-root on same host
- running multiple processes as different users
- running processes in different LXC containers as non-root on same host
- running root processes in different KVM VMs on same host
- running non-root processes in different KVM VMs on same host
- running root processes in different Xen/domU VMs on same host
- running non-root processes in different Xen/domU VMs on same host
- ...
Or in other words if you get an account/container/VM on a shared machine from a hosting provider using technology X, how does that compare to getting an LXD container from a hosting provider?
(provided that other unknown users can run LXD containers on the same machine as yours).
"No attack surface" was definitely simplifying too much, especially for LXD. I think I was trying to say that it is not impossible to mitigate those attacks.
In the pure sense, a hypervisor doesn't need to do anything except create a virtual machine. It doesn't need a way to interact with a user or even the vm once it is created. I have written a bare metal, type 1 hypervisor that did nothing but key log. The guest never made a hypercall and wasn't aware that it was a guest at all. Side note, I'm not an expert. Hypervisor research is just for fun.
We know there is an attack surface on LXD immediately because of the REST API and its interaction with containers. Any resource mediation also exposes an attack surface. Resource mediation is difficult, but not impossible. The attack surface really depends on implementation.
With my limited knowledge of the linux kernel, I can imagine a kernel running in its own vm, a vm for every container, and every container sharing read-only access to the single kernel. Each container could also be isolated via the same memory protection. I don't know enough to say that's possible. I think you're more knowledgeable than I am about lxc and the kernel in general. Any thoughts on this?
I'm not worried about memory protection, there is HW support for that and it can be done.
I'm slightly more worried about making sure that separate containers can't access each other's disks (via symlinks/hardlinks or overflowing some FS structures).
And I'm worried about the privileged kernel/hypervisor parsing/interpreting data from the unprivileged container.
In that sense the situation is not much different from a server: if you can exploit a bug in the server you can run/perform actions with the server's privileges.
Same situation with the kernel.
I'd wait until there are some more design/architecture docs about what LXD is exactly to say more though.
DoS is still a problem but containers should provide mitigation for that. You can make the vmm prevent DoS, but it's better to keep the vmm small and light.
As for local kernel privilege escalation, yes, it would still run, but it might not matter. In theory, the VMM can isolate all virtual machine resources such that rooting a VM only gives you that VM. I can't figure out how they extend that protection to containers yet since VT-x was made for full virtual machines and containers share a kernel.