I think speed is definitely a big part of this, though the speedup comes primarily from getting rid of superfluous calls to RDRAND. Blake2s is already in the kernel (after a fashion) for WireGuard itself; I don't think Blake3 is. An additional nerdy point here is that the extraction phase of the LKRNG is already using ChaCha (there's a sense in which a CSPRNG is really just the keystream of a stream cipher), and Blake2s and ChaCha are closely related.
So should we expect a blake3 switch sometime in the future? It seems to be a refinement of blake2 to make it more amenable to optimization while keeping most (all?) its qualities. Being well suited for optimization across architectures would also make it ideal for the kernel and it seems the reference implementation has already done a lot of the heavy lifting.
Just in case this isn't clear, BLAKE3 breaks a single large input up into many chunks, and it hashes those chunks in parallel. The caller doesn't need to provide mulitple separate inputs to take advantage of the SIMD optimizations. (If you do have multiple separate inputs, you can actually use similar SIMD optimizations with almost any hash function. But because this situation is rare, libraries that provide this sort of API are also rare. Here's one of mine: https://docs.rs/blake2s_simd/1.0.0/blake2s_simd/many/index.h....)
https://github.com/BLAKE3-team/BLAKE3