1. Fortran code being 2x slower than anything for array operations means you're not using Fortran right. So this result says more about f2py generating bad Fortran code than anything else. Labeling that as "Fortran" is disingenuous – it should be labeled "f2py".
2. I'm surprised that SciPy and Scikit aren't faster than this. You can use clever techniques to compute pairwise distances asymptotically faster than the naïve loop approach. For example, see the Julia Distance.jl package [1], which gets a 125x speedup [2] over a naïve loop implementation for computing pairwise Euclidian distances. I haven't done a direct comparison but for simple scalar loops like this, Julia has C-speed, which means its naïve version would be about the same as Numba and Cython and the Distance.jl version would be correspondingly faster. Of course, the amount of speedup depends on the shape of the matrices, so 125x speedup might not happen in this particular case.
I think Jake was pretty straight forward in saying in the post that he isn't a fortran expert and was looking for someone to offer up a better version if he was doing something non-optimal.
That computation is probably something you were already familiar with in disguise: it's just the law of cosines [1][2] along with the fact that a matrix of inner products can be computed with a matrix-matrix multiplication [3].
The first claim is just that for vectors x and y in an inner-product space ||x-y||^2 = <x-y,x-y> = ||x||^2 + ||y||^2 - 2<x,y>, and the second claim is just that if you collect a bunch of vectors into the columns of X and another bunch into the columns of Y, then the (i,j) element of X'Y is <x_i,y_j>, i.e. the inner product of the i'th vector in X with the j'th vector in Y. (Bonus fact: you can use that relationship to translate the statement "this set of pairwise distances can be embedded in a Euclidean space" to an equivalent condition that some simple matrix is negative semidefinite on a subspace orthogonal to the all-ones vector. Euclidean metric embedding is very useful!)
Compiling in gfortran with -O3 sometimes solves that automatically for you, but you must try to use the correct order. And with less optimization the difference in the order of the index can produce very big difference in time dew to the cache problems.
Also, the C/phyton program uses a few trick like += and tmp variables. I don't know how they interact with the optimizations, -O3 does very strange things and perhaps they generate the same code.
The text is definitely clear about it if you read everything, but the chart could easily be misread as saying that Numba and Cython are 2x faster than hand-written Fortran. That's my main objection. I also think that writing a simple C version of this code would serve as a much better baseline than f2py-generated Fortran code.
That was definitely an inappropriate labeling of the Python benchmarks on the Julia home page. We fixed it as soon as it was pointed out and I'm grateful to whoever did point that out. I'm just doing the same thing here.
Dahua's explanation is very cool and he's really created an incredibly high-quality, comprehensive package for distance calculations. It's great work.
I attempted to implement the Julia technique in Cython using BLAS dgemm but was not able to beat Jake's original version. Not sure if I am doing something wrong or linking against a non-optimal BLAS library on my machine. Comments/feedback is welcome:
I don't know anything about Fortran and haven't examined the code, but this comment on the blog to Fernando Perez seems to have a plausible explanation:
> The C/Python-Code has its inner loop cache friendly, while the Fortran code has a cache unfriendly inner loop (because X(i,k) and X(i,k+1) are far away in memory). -- TS
Due to Fortran's memory model, its performance could potentially be drastically improved by altering the arrays such that the program is accessing consecutive columns rather than rows.
from:
r = r + (X(i,k) - X(j,k)) * (X(i,k) - X(j,k))
I've tried using it twice on real code and in both cases the numba code ended up significantly slower than straight python/numpy. Admittedly I didn't bother digging into what was going on under the covers and I'm sure someone who reallyt understood numba could probably get it to work better. But as some simple magic to make your python code faster without rewriting anything, numba hasn't worked for me on real code.
This has been my experience as well for "real world" tests, although I'm still excited about the project. I think once they get a graphical annotate in place akin to `cython -a`, this will help quite a bit in terms of figuring out what is going on.
1. Fortran code being 2x slower than anything for array operations means you're not using Fortran right. So this result says more about f2py generating bad Fortran code than anything else. Labeling that as "Fortran" is disingenuous – it should be labeled "f2py".
2. I'm surprised that SciPy and Scikit aren't faster than this. You can use clever techniques to compute pairwise distances asymptotically faster than the naïve loop approach. For example, see the Julia Distance.jl package [1], which gets a 125x speedup [2] over a naïve loop implementation for computing pairwise Euclidian distances. I haven't done a direct comparison but for simple scalar loops like this, Julia has C-speed, which means its naïve version would be about the same as Numba and Cython and the Distance.jl version would be correspondingly faster. Of course, the amount of speedup depends on the shape of the matrices, so 125x speedup might not happen in this particular case.
[1] https://github.com/lindahua/Distance.jl
[2] https://github.com/lindahua/Distance.jl#pairwise-benchmark