I wrote up a small script to test this out for python (at bottom of this post). The results on my laptop are about 8.0 seconds for unsorted and 7.6 seconds for the sorted version. I'm assuming that the discrepancy for python is much smaller due to the slow nature and high overhead of the language (or at least the way I've used it here), but I would be interested to know: how would one go about finding out what the python interpreter is doing beneath the surface?
Edit: After running with a wider range of parameters, it seems that the difference is always roughly the same order of magnitude. To investigate further, I included the sort into the second timing to double check and for 3276800 elements it's still a bit faster overall when you sort the array.
#!/usr/bin/env python
import time
import numpy as np
def main(n=32768):
arr = np.random.randint(0, 256, n)
t0 = time.time()
sum1 = do_loop(arr)
t1 = time.time()
arr = np.sort(arr)
t2 = time.time()
sum2 = do_loop(arr)
t3 = time.time()
assert sum1 == sum2
print(" Unsorted execution time: {} seconds".format(t1-t0))
print(" Sorted execution time: {} seconds".format(t3-t2))
def run_many(func):
def wrapper(arg):
for t in range(1000):
func(arg)
return func(arg)
return wrapper
@run_many
def do_loop(arr):
tot = 0
for i in arr:
if i >= 128:
tot += i
return tot
if __name__ == '__main__':
main()
As you can see, the pure Python version is faster than the Numpy version, and also has a larger margin between unsorted and sorted. PyPy is of course faster than both, and also has an even greater margin between unsorted and sorted (2.63x faster).
Good call on going pure python. To take this a bit further I made your changes and used numba with @jit(nopython=True, cache=True), for some interesting results. If I do include the sorting into the timing:
I'm not quite sure what you mean by "using python for micro benchmarks usually doesn't work". To the extent that "microbenchmarking" is useful at all, then sure it works. There are just different considerations from other languages, and it depends on which Python implementation and libraries you're using.
Also, while I'll grant you that using Python implies that convenience, programmer time, and/or development speed is a higher priority than performance, that doesn't at all mean that people who use Python "don't care about performance".
I don't think it's related to branch prediction in particular, just spooky action-at-a-distance where a change makes seemingly unrelated code much slower or faster.
What makes you say this? The author provides an extremely persuasive case, with perf timings. Additionally updated it with compiler enhancements from GCC and Intel that remove the branch mispredictions entirely and do perform as predicted.
I think you misunderstood me. I agree that the StackOverflow post has to do with branch prediction. I just meant that I don't think that's why the earlier poster thinks it's parallel to the situation described in the article.
It does make optimization easier but until MIR landed, many of the best optimizations weren't really possible. The problem is that a lot of type information is lost between Rust and LLVM IR, where the compiler does the really serious optimizations. For example, Rust can't tell LLVM about its pointer aliasing guarantees (immutable and mutable borrows can't be done at the same time without unsafe) so a lot of optimizations like storing a highly used value from the heap in a register are passed over because of conservative heuristics.
Now that MIR has landed, Rust will eventually get much better optimizations from both rustc and the llvm optimization passes but other things are a much higher priority like non-lexical scoping.
From what I understand, LLVM still doesn't take as much advantage of that information as it could, given Rust input. It's too geared toward the C family of languages. (But as sibling comment says, the problem was partially the Rust compiler's fault.)
Also: "Originally implemented for C and C++, the language-agnostic design of LLVM has since spawned a wide variety of front ends: languages with compilers that use LLVM include ActionScript, Ada, C#,[4][5][6] Common Lisp, Crystal, D, Delphi, Fortran, OpenGL Shading Language, Halide, Haskell, Java bytecode, Julia, Lua, Objective-C, Pony,[7] Python, R, Ruby, Rust, CUDA, Scala,[8] and Swift."
Most of these are nothing like C when it comes to pointers and memory layout.
Again, from what I understand, `restrict` is not enough to convey everything Rust knows.
Further, those other languages may be nothing like C, but they're even less like Rust. So because of its heritage, even given the information Rust knows, LLVM simply doesn't have the optimization passes to take full advantage of it.
It's not only a MIR issue, it's also an issue of how many optimizations should be allowed when you also need to be able to make some assumptions about your data in unsafe blocks. For example if you simply added the obvious attributes everywhere, interior mutability (Cell, RefCell, UnsafeCell) would most certainly break. The exact rules around unsafe rules are still being discussed (see https://github.com/nikomatsakis/rust-memory-model), so until that is done some of these optimizations are very unlikely to be implemented because they can make a lot of unsafe code illegal.
Note that the compiler already adds "obvious" attributes to function calls, being careful to not emit them for things that contain an `UnsafeCell`. This is why that type exists, as the building block for interior mutability that the compiler understand and can optimise around.
I guess having taken a compilers class a decade ago has given me a Dunning-Kruger sense of optimism but the explanation sounds kind of like bullshit to me. So what if LLVM doesn't have the type information? There's no type information at all in the machine code… optimize stuff yourself before you hand it to LLVM?
LLVM is literally designed as an production quality optimisation/code-generation backend, so it's seems perfectly reasonable for projects to rely on LLVM's optimisations without doing their own, especially when there are other things people can work on than the large effort to optimise some programs slightly better (and even fewer programs a lot better, but these are often microbenchmarks, or tight loops like the OP that can be diagnosed with a profiler relatively easily and rewritten) by doing language-specific optimisations. It's definitely a good long-term goal, but it requires a lot of infrastructure, and ends up requiring duplicating a lot of the optimisations and analysis already in LLVM (e.g. inlining is a critical transformation for enabling other optimisations) before one can get great results from the custom optimisations.
In any case, compiler optimisations are usually easy to implement on specific styles of internal representations, and this is MIR for Rust, which was only introduced recently, after requiring a very large refactoring of the compiler's internal pipeline.
As I understand it, this is partially the plan. Historically it has just been waiting for a compiler refactor to MIR (which is now complete). I'm not sure why more optimization RFCs haven't been created and prioritized.
I think it's kinda interesting what tiny little tweaks will affect code speed- I recently discovered that in python, if you're only going to use an import for one or two functions, importing it locally shaves off a good bit of time depending on the function, in my case, .2 seconds!
I don't think the function itself matters; what's going on is that if you import it locally, the reference you're using is in the local namespace versus being in a module's namespace, which means it takes less time to get to the function object. I'm guessing the function is being called a fair number of times in a loop or some such similar manner?
Ok. It wasn't clear what "importing it locally" meant. I've worked with some people who would use that phrase to mean "copy paste it into the local file".
Python is notable for how it harasses the filesystem. If your python program is running off NFS or parallel filesystems (e.g. gpfs, panasas, lustre, etc) then you may get a lot more performance from avoiding imports. Especially if pkg_resource is being used.
like all generic interfaces in Rust, printing takes arguments by reference,
regardless of whether they are Copy or not. The println! macro hides this from
you by implicitly borrowing the arguments,
What's the reason for this? Seems a bit inconsistent that for functions you must explicitly pass with &, but for print it's automatic.
- it'd be annoying to have to write `println!("{}", &s);`,
- this is not an inconsistency one thinks about in practice: ownership is important in Rust, but the compiler does all the annoying checking, so IME one can just let it fade into the background and not think about it until the compiler tells you there's a problem,
- it's an "permissive" inconsistency: writing a & will still compile and give the same output,
- it's a syntax-level inconsistency created by the println! macro packaging up the normal syntax, not some special rules for special functions: there's still a literal & borrow in there (see the macro expansion),
- historical reasons, and no-one was annoyed enough (or even noticed it enough) to change it.
Hmm. If people would think it was annoying to println!("{}", &s) they wouldnt use rust in the first place because you have to do that everywhere else for plain functions.
Beeing permissive makes it even worse as there are now 2 inconsistent ways to do the same thing.
Maybe I need to understand rust macros better. I guess this all boils down to: macros are a very bad idea, as in other programming languages.
And if print can already have a special treatment, it should surely be the one where the value is copied if can fit the basic type size, that is, at least for int like and float like stuff not have automatic pass by reference but automatic pass by value to it?
The problem with this is print is a macro and that macros are expanded before the type checker. This makes it impossible to know if something can be copied or not, because this is type information.
I believe it could be a solvable problem? Imagine that the macro expands every parameter x to some language construct f( x ) where f( x ) has access to the type of x and the type of its result is different depending of the type of x, returning a value for for int and float-like variables, and the reference for others.
It depends on the system. First, the int would be by value anyway, so it wouldn't matter.
But assume you handed it a pointer instead.
Some are properly marked readonly in declarations, some are not[1]
But this is a common problem with unmarked logging. The compiler will infer attributes all the way through things it has bitcode for.
(IE it can determine it to be readonly, readnone, whatever, as long as it can see calls to either functions marked readonly/readnone, or determine the called functions to be readonly/readnone)
But, usually, if you don't give it the whole program, it can't infer that your logging function does not modify its arguments.
So sadly, it's printf is not readonly, because it can do anything it wants in these callbacks.
(nobody uses the printf registration, though, so it may be time to just add a compiler option required to make it work, and not screw everyone else using printf over)
It may not. It's possible that this may be happening because rustc is failing to mark the reference as both readonly and unaliased, whereas clang might mark the equivalent as both. You'd need to examine the IR from both front ends to really figure out the source of any differences in the generated ASM
If you use printf and cout in the same program then yes you can have dramatic performance problems with cout unless you use `std::ios_base::sync_with_stdio(false);`
It's not the same issue but it's the same area of stupid stuff you have to find out about and roll your eyes.
This is awful. Was just reading how Rust is all shiny and special and much better than awful, naughty C++, and now I read how the print method is magic goo because "developers don't want to type & in front of ints". Back in my day, bah, grumble...
Why is it faster to process a sorted array than an unsorted array?
https://stackoverflow.com/questions/11227809/why-is-it-faste...