I've lost count of the times I've sped up some existing code by removing the hand-optimized assembly and just used a plain C implementation or similar. Sure, it was hand-optimized assembly, but for a 10+ year old CPU.
If you're going to do hand-optimized code for a given platform, include the baseline code and measure at runtime to pick the implementation.
Hand optimized assembly was necessary 30 years ago, and a good for several optimizations 20 years ago. But today's computers are SO FAST, it's just not necessary in most situations.
If you're going to do hand-optimized code for a given platform, include the baseline code and measure at runtime to pick the implementation.