True, but the use of SIMD is also restricted in the kernel, so that would not matter for the Linux kernel.
In comparison with the hardware instructions for SHA-2 or SHA-3, on those CPUs which have them, BLAKE3 is not faster, because those instructions also use the same SIMD registers, processing the same number of bits per operation.
I use every day BLAKE3, for file checksumming. For this application, on my Zen 3 with hardware SHA-2, BLAKE3 greatly outperforms everything else, but only due to multithreading.
On older CPUs, without SIMD SHA instructions, you are right that BLAKE3 can be faster than other algorithms even in single-thread, by exploiting the parallelism of the BLAKE3 algorithm with a SIMD implementation.
For other hashes, SIMD may not accelerate the computation of a single hash, but when you need to compute multiple hashes you can interleave their computations and obtain similar speedups with SIMD instructions.
Isn't explicit hardware support for SHA-3 rather limited? In particular, there's none on Intel and only A13 and A14 on Apple. It can still be vectorized to a degree on other CPUs, but in that case it'll be slower than Blake3.
For now, only a few extremely recent ARM cores have SHA-3 instructions.
On the other hand, support for SHA-256 and for SHA1 (still useful for non-secure applications) is widespread, in almost all 64-bit ARM CPUs, in all AMD Zen and in some of the Intel CPUs, e.g. Apollo Lake, Gemini Lake, Jasper Lake/Elkhart Lake, Alder Lake, Tiger Lake and Ice Lake.
In comparison with the hardware instructions for SHA-2 or SHA-3, on those CPUs which have them, BLAKE3 is not faster, because those instructions also use the same SIMD registers, processing the same number of bits per operation.
I use every day BLAKE3, for file checksumming. For this application, on my Zen 3 with hardware SHA-2, BLAKE3 greatly outperforms everything else, but only due to multithreading.
On older CPUs, without SIMD SHA instructions, you are right that BLAKE3 can be faster than other algorithms even in single-thread, by exploiting the parallelism of the BLAKE3 algorithm with a SIMD implementation.
For other hashes, SIMD may not accelerate the computation of a single hash, but when you need to compute multiple hashes you can interleave their computations and obtain similar speedups with SIMD instructions.