> Windows goes through a mandatory process whereby it identifies hardware, installs and configures drivers etc. etc. etc. You can inject a script to carry out the process of things like setting up a user, and generating a password, but that's fairly minor on the scale of things. This first process requires a reboot. There's no escaping it.
This is the "OOBE" (out of box experience) phase. I'd have thought they'd just skip it and provision you a pre-warmed image .. but I guess because there's a license/activation dependency they can't do that?
I wonder how many MWh could be saved by Microsoft adding a "acquire cloud volume license at boot" mode. Or shoving the licensing/uniquification requirements into the platform TPM.
OOBE occurs after this, the hardware identification phase is what takes the longest time. It has to reboot to load drivers and take 5-10 minutes to identify what hardware in the machine needs to have drivers, etc. Once you get to the blue "Just a moment..." screen, you can inject a script to activate Windows and get to the desktop.
License activation stage is negligible, and doesn't need to be done for several days+. The actual activation of windows doesn't really enter in to the story here.
Disclosure and claim to authority: I work on the Windows team at Microsoft, sometimes on performance and OS installation stuff.
There's definitely some cruft that chews up time on first boot. But it's not everybody's favorite punching-bag, licensing. That stuff doesn't happen in the critical boot path.
It might be installation of device drivers, but that too is unlikely. If you generalize Windows in a VM, you can use the `sysprep.exe /mode:vm` flag, which essentially tells sysprep to retain most of the device tree, since you expect to run the thing on similar hardware. I would assume that AWS is clever enough to have found that flag; certainly we have Azure use it. When the flag is used, there's very little device- and driver-related work to do on first boot after generalization.
The reality is that software is complicated and hard, and anything punchy enough to fit into a comment on a website is going to be a vast simplification of reality. So let the simplification begin :)
One reason first boot is slow is the component that orchestrates startup of usermode services, which on Windows is called SCM. SCM is very old. At the time SCM was created, it was much better than the SysV-style init scripts of other OSes. But since then, other OSes leapfrogged Windows with systemd/launchd, which are a generation ahead of SCM. SCM starts services in serial, while systemd maximizes parallelization. SCM has a "push" model: it basically starts all the services that it can find, while systemd has a "pull" model: it starts just the dependency cone you need to get the system you want. (This is a simplification.)
Another performance issue is that Windows doesn't have a way to notify code that the hostname has changed. Obviously it'd be easy to add one, but then the hard part would be updating the whole OS to do something reasonable with that notification. So instead, Windows requires a reboot to change the hostname. Except first boot: to avoid a reboot as soon as you power on your shiny new computer, there's a clumsy dance where the OS holds back most usermode processes until the hostname is set, then it sort of tries booting usermode again. (Huge simplification!)
Thirdly, the footprint of Windows is just bigger than that of an expertly hand-tuned Linux installation. Much of this problem was solved with Nano Server... but are you actually using Nano Server? It turns out that people like Windows because Windows runs Windows programs. Take away compatibility with many Windows programs, like Nano Server did, and you get a much faster and more secure OS that nobody's heard of.
We take both perf and cloud hosting seriously, and we're working on problems in this space. You should expect Windows to get better with each release. But to close this off, I don't want to hog all the blame. It's always possible that AWS is doing something silly in their guest agent or paravirtualization stack that measurably degrades boot perf. We've previously caught Azure doing silly things -- now fixed -- that seriously delayed the amount of time before the guest reported itself as ready. If you want to see Windows hosting done well, try Azure.
> you can use the `sysprep.exe /mode:vm` flag, which essentially tells sysprep to retain most of the device tree, since you expect to run the thing on similar hardware
In my experience that only really works with full paravirtualized environments. If you start mixing in SR-IOV things can get a little messy. Given customers actually crave high performing networks, that then presents you with a choice:
1) Make two images. One for full paravirtualized environments.
2) Ignore it and let Windows first boot time take longer.
Of course most clouds already end up with what are effectively multiple images for the same setup/configuration of Windows, one per hardware type anyway, because even with full PV things can get a bit strange, and you're often ending up with blue screens during first boot.
They're not going to want to _double_ that number. Even with full automation that's a bunch more things that can go wrong, more operational burden etc. etc. Where's the value proposition? Windows provisions a little faster?
Linux images rarely need to be produced for different hardware types / environments. They just spin up and away you go. As to your systemd/sysv comment.. even sysv based instances have a time-to-login on first boot of under a minute.
The systemd developers were obsessed with the idea that parallelism would speed up the boot process, but it doesn't make as significant a performance impact as they'd have you believe, especially when you're talking about cloud images that rarely have many services running on first boot. Even if you go trawling down the systemd boot time reporting, you'll see that most components start in fractions of a second, and the same was true under SysV too.
This is the "OOBE" (out of box experience) phase. I'd have thought they'd just skip it and provision you a pre-warmed image .. but I guess because there's a license/activation dependency they can't do that?
I wonder how many MWh could be saved by Microsoft adding a "acquire cloud volume license at boot" mode. Or shoving the licensing/uniquification requirements into the platform TPM.