Measuring implementations of the same algorithm is how you benchmark algorithms, not a a language as a whole. If a language or library allows for enhancements based on it's strengths (in this case, the ability to code in early exits), those are perfectly valid in benchmarking languages and libraries.
Put another way, one of the benefits of Python (and drawbacks of using an external library) is that you have more control over the algorithm and exactly what it does.
That said, a sample size of one will hardly give you an accurate picture.
> optimization by precomputing F values
> I also optimize by skipping the rest of the convolution if the first result isn't zero.
If you don't measure implementations of the same algorithm, you hardly have a fair language/library benchmark.