That particular implementation is less than ideal. It has issues with correctness (the precision of FP32 accumulators was not enough for my use cases, needed an upgrade to FP64), performance (no SIMD), and subjective code quality (I tend to avoid global variables for things like that).
That particular implementation is less than ideal. It has issues with correctness (the precision of FP32 accumulators was not enough for my use cases, needed an upgrade to FP64), performance (no SIMD), and subjective code quality (I tend to avoid global variables for things like that).
But the algorithm itself is awesome.