> Each operating system essentially gets a fixed allocation of RAM, something like 32-48 GB. This can lead to quite a lot of wasted resources, VM #3 may really need 64 GB instead of 48 GB at a point where VM #4 may have 24 GB to spare.
I don't know what virtualisation the author has in mind, but static RAM allocation isn't a requirement for VMs. VirtIO's memory balloon can be used to dynamically grow and shrink available memory. It's not exactly something you can do willy-nilly, but when a VM has a large amount of free RAM (like 24GB) you can definitely use ballooning to temporarily reassign memory capacity.
Since the author is talking about using Debian, they could opt for Proxmox to handle ballooning for them. Without Proxmox, scripts similar to this (https://github.com/berkerogluu/auto-ballooning-kvm/blob/mast...) could also be used to control memory distribution, though I imagine a server like this needs something a little more sophisticated.
Since the author is running dual processors, they will need to account for NUMA domains. That is, half of the RAM is assigned to each CPU, and performance will suffer if a thread on CPU 0 is accessing memory belonging to CPU 1. numactl[0] can be used to bind a process to a specific NUMA node on bare metal, but a hypervisor can also set CPU/memory affinity for a given VM without fiddling about at the process level.
Interesting. Would this still allow the virtualized machines to manage their own page tables as though native? I can't say I understand how it's implemented, but I'm concerned it might leave the host system with the same page thrashing the choice to use virtualization was intended to avoid.
You could examine LXC, which is a container setup, for isolation. Adding memory doesn't require a restart of the container. You may wish to examine ZFS or, going farther afield, even ZFS with FreeBSD with or without jails/bhyve.
They're trying to use VMs to run multiple page table managers on one single host. Containerization will use the same kernel's page management, so I don't think that'll solve the author's problem.
I don't know if this has any impact on the page tables, but the way it works is relatively simple: the hypervisor communicates with a driver inside the VM and tells it to "allocate" large amounts of memory. How much depends on the configuration; if you tell the hypervisor to configure a minimum of 16GB and a maximum of 48GB, then the driver will "reserve" 32GB of RAM. To the VM, it's as if some kind of driver is taking up 20GB of RAM, while I reality the hypervisor just doesn't allocate the RAM to the VM.
Then, when the VM runs out of memory, the virtio driver starts releasing pages for as long as the hypervisor lets it, slowly increasing the amount of RAM allocated to the virtual machine. This works until the hypervisor runs out of memory to allocate, after which the driver inside the VM will refuse to release pages and the VM's OOM will kick in.
The default is grow only (start with the minimum amount of RAM and only release more to the VM as time goes by), but it's possible to ask the driver to start "allocating" memory again with enhancements, like the ones proxmox do or by using scripting. With the guest tools installed, the hypervisor knows exactly how much free RAM each VM has.
This shouldn't cause problems for page tables on the client side as far as I know. However, certain workloads may not work well; if the VM allocates memory faster than the balloon driver canrrelease it, you'll get out of memory errors. Similarly, quick allocate-free cycles will prevent any auto shrink mechanism from working right.
You could try and run an experiment if you wish; the balloon driver works just as well regardless of whether you're allocating 16GB of RAM or 512MB. You could run a bunch of VMs with a model of the memory allocation you expect on your desktop and see if you end up page trashing like you fear.
that depends on your setup. ballooning is just a function to move memory from the guest to the host; if you move too much memory you'll get thrashing. the bigger problem is I/O indirection through the host; for I/O heavy workloads you'll want to use PCI passthrough for best performance, which (assuming it's working properly) should be identical to native performance.
> The software is based around memory mapped storage, and Linux’ page fault handler can only put up with so many page faults at any given time.
> A potential way around this is virtualization, to run multiple operating systems on the same machine.
What’s the actual issue? Linux can have trouble servicing many page faults in parallel in a single process (technically mm) due to lock contention. Multiple processes should reduce this contention.
It's a fairly hypothetical problem, like it may not even be a problem, other than the intuition that having 8 applications all thrashing wildly and competing for the same page table may not be the most performant way of allocating resources.
> having 8 applications all thrashing wildly and competing for the same page table may not be the most performant way of allocating resources.
They’re not, though. There’s a tree of “page tables” (that is, a tree with branching factor 512 or so, in the format used by the CPU) per process. Also, per process, there’s a tree of VMAs (the logical maps from contiguous virtual address ranges to whatever logically backs them) — these are created by mmap and friends. And, regrettably, a lock, also per process (although this lock is a read-write lock, and page faults are reads).
If you have a whole bunch of processes mmapping the same file and thrashing against it, you could end up with contention for that files’s data structures that track mappings, but that seems unlikely. Mappings of different pages of ordinary memory should scale well.
And a VM, for this purpose, is more or less like a process. QEMU (or whatever other userspace host you use) literally maps everything that the VM logically maps, and VM faults are handled as though QEMU triggered a page fault.
Have you thought of experimenting/benchmarking with two or three different architectures right now before you settle on one?
As you wrote in the article, you still have the old server that can support your current load. That likely won’t be an option in the future as your load continues to grow.
This is sort of what I'm doing, except not really considering virtualization as a serious option. I've worked with it a lot in the past and it's been awkward and annoying. I'll go as far as containers, but I'm seriously not seeing what virtualization would add here other than obstacles.
I'm also feeling out what's a way of working with this machine that isn't a huge pain in the ass. When you've got one instance running on one machine, manual deployments is fine, but I think something more CI-driven is probably going to be necessary to keep sane with 8 index shards and a test environment as well.
Kubernetes would be an option but I have bad experiences with that too. A bit too much spooky action at a distance for my taste. Whole ecosystem feels very fragile and churny in a way I'm not very happy with, and the abstractions designed for hiding away the complexities of dealing with a cluster make running it on a single machine where those abstractions aren't necessary just pointlessly awkward.
What you mean you're not using Kubernetes, Helm, Redis, Memcached, RabbitMQ, Terraform, three different Apache projects, cloud managed Postgres, S3, and 10 micro services?
How unprofessional! It's like you're building something efficient that isn't going to give massive amounts of money to cloud providers.
@marginalia_nu: Definitely not saying this should be a top priority to fix, but I tried to look at the source out of interest, and the Git repo link in “Feel free to poke about in the source code or contribute to the development” on the search page is currently 404ing.
Hmm, should be fixed now. I used to run my own git forge, but moved over to github, leaving the git.marginalia.nu domain with a redirect that was supposed to direct over to the github repository, but apparently it didn't quite work for that link for some reason.
It's a philosophical thing. I want to own my presence on the web, not rent it. So even when I'm using 3rd party services, I want a subdomain I own to point them out, so you can say like "here's the authoritative link", and if microsoft goes and enshittens github, I can just redirect it to codeberg or whatever comes next.
I think it works now though. It's a pain to migrate several dozen nginx sites, especially with browsers helpfully remembering stale redirects for weeks.
between mandatory 2FA "for your safety", the copilot scandal, and the way M$ treated paying mojang customers who refused to make microsoft accounts - github's future looks bleak.
IMHO it's just a matter of time before the inevitable email: "all github accounts are being migrated to microsoft accounts. your github credentials will no longer work after mm/dd/2026. please migrate to continue using github."
> I’ll also apologize if this post is a bit chaotic.
Loved this, didn’t find it chaotic at all.
Not sure if I missed it, but how are you planning on moving data from the old to the new storage? Do you have any concerns with corruption at that stage (validation)?
It's already moved over. The data can be reduced to a fairly compressed form where it's about 1.6 TB in total, so it's easy to just tarball and check with an md5sum that it's the same on both ends.
As described in the post, a lot of the data is also heavily redundant so even if something goes wrong in one or a few places, the missing parts can be reconstructed from the rest.
New server is roundabout $20,000 if you were to pay for the free CPU upgrade. Old server was about $5,000.
It's really hard to say how much faster it is going to be, but it's at definitely much faster than the old server. I was not really having performance problems before either, though. The main obstacle was just dealing with insane volumes of data with limited RAM and disk.
I don't know what virtualisation the author has in mind, but static RAM allocation isn't a requirement for VMs. VirtIO's memory balloon can be used to dynamically grow and shrink available memory. It's not exactly something you can do willy-nilly, but when a VM has a large amount of free RAM (like 24GB) you can definitely use ballooning to temporarily reassign memory capacity.
Since the author is talking about using Debian, they could opt for Proxmox to handle ballooning for them. Without Proxmox, scripts similar to this (https://github.com/berkerogluu/auto-ballooning-kvm/blob/mast...) could also be used to control memory distribution, though I imagine a server like this needs something a little more sophisticated.