One of the gripes that I have with Julia is that if you write linear algebra code naively, you will have tons of unnecessary temporary allocations, while in Eigen (a C++ library) you can avoid most of these without sacrificing too much readability. (It even optimizes how to run matrix kernels on the fly!) Sure, you can rewrite your Julia code in C-style to remove those temporary allocations, but then the code becomes even less readable than what you can achieve in C++.
The naive Julia version has unnecessary allocations and therefore is 23% slower than the optimized version:
@inbounds for k = 2:60000
Pp .= Fk_1 * Pu * Fk_1' .+ Q
K .= Pp * Hk' * pinv(R .+ Hk * Pp * Hk')
aux1 .= I18 .- K * Hk
Pu .= aux1 * Pp * aux1' .+ K * R * K'
result[k] = tr(Pu)
end
In order for this loop to match the C++ version you need to use C-style functions:
for k = 2:60000
# Pp = Fk_1 * Pu * Fk_1' + Q
mul!(aux2, mul!(aux1, Fk_1, Pu), Fk_1')
@. Pp = aux2 + Q
# K = Pp * Hk' * pinv(R + Hk * Pp * Hk')
mul!(aux4, Hk, mul!(aux3, Pp, Hk'))
mul!(K, aux3, pinv(R + aux4))
# Pu = (I - K * Hk) * Pp * (I - K * Hk)' + K * R * K'
mul!(aux1, K, Hk)
@. aux2 = I18 - aux1
mul!(aux6, mul!(aux5, aux2, Pp), aux2')
mul!(aux5, mul!(aux3, K, R), K')
@. Pu = aux6 + aux5
result[k] = tr(Pu)
end
... which is quite dirty. But you can write the same thing in C++ like this (and even be a bit faster than Julia!):
for(int k = 2; k <= 60000; k++) {
Pp = Fk_1*Pu*Fk_1.transpose() + Q;
aux1 = R + Hk*Pp*Hk.transpose();
pinv = aux1.completeOrthogonalDecomposition().pseudoInverse();
K = Pp*Hk.transpose()*pinv;
aux2 = I18 - K*Hk;
Pu = aux2*Pp*aux2.transpose() + K*R*K.transpose();
result[k-1] = Pu.trace();
}
which is much more readable than Julia's optimized version.
If Julia had a linear-algebra-aware optimizing compiler (without the sheer madness of C++ template meta-programming that Eigen uses), then Julia's standing in HPC would be much, much better. I admit that it's a hard goal to achieve, since I haven't seen any language trying this (the closest I've seen is LLVM's matrix intrinsics (https://clang.llvm.org/docs/MatrixTypes.html), but it's only a proposal)
Oh, you're right. I naively thought that the Julia code was written with static-size arrays.
But still, wouldn't even StaticArrays.jl create some unnecessary allocations and copies when written in the naive (clean) way? (for example, when calculating something like z = Ax + By, Ax and By are temporary allocated in Julia's case but you can avoid this in Eigen?)
StaticArrays.jl will create arrays that are stack allocated and optimized just like in Eigen. There are no temporaries in that example.
Even if you used MArrays, which are small statically sized mutable arrays, the intermediate temporary arrays would live on the stack and thus not cause any allocations.
I suppose I should caveat my grandparent comment by saying that StaticArrays and MArrays are only good up to 14 x 14, if memory serves me correctly. That rule of thumb might be out of date; the compiler might construct large tuples more efficiently these days.
I believe Eigen has similar problems, but im not sure.
However, there is work happening for large stack allocated arrays in julia. StrideArrays.jl can create stack allocated arrays that work on any size that will physically fit on the stack.
In this code, the main problem---I think---is that there are intermediate results that are being allocated, e.g., Fk_1 * Pu * Fk_1'. I will speculate that you could improve on the baseline code by preallocating these in the same way as Pp, K, aux1, and Pu are initialized outside of the loop.
Are you sure that the difference is due to the allocations? I would expect this to be dominated by matrix multiplies or svds. Are you comparing this with the same blas/LAPACK?
Edit : OK, I see those are small matrices. Then Staticarrays should be a nice contender here, both for speed and readability.
Here's an example: https://ronanarraes.com/tutorials/julia/my-julia-workflow-re...
The naive Julia version has unnecessary allocations and therefore is 23% slower than the optimized version:
In order for this loop to match the C++ version you need to use C-style functions: ... which is quite dirty. But you can write the same thing in C++ like this (and even be a bit faster than Julia!): which is much more readable than Julia's optimized version.If Julia had a linear-algebra-aware optimizing compiler (without the sheer madness of C++ template meta-programming that Eigen uses), then Julia's standing in HPC would be much, much better. I admit that it's a hard goal to achieve, since I haven't seen any language trying this (the closest I've seen is LLVM's matrix intrinsics (https://clang.llvm.org/docs/MatrixTypes.html), but it's only a proposal)