How copying an int made my code 11 times faster

Cyphase · on Feb 20, 2017

This reminds me of this classic StackOverflow question:

Why is it faster to process a sorted array than an unsorted array?

https://stackoverflow.com/questions/11227809/why-is-it-faste...

MooMooMilkParty · on Feb 20, 2017

I wrote up a small script to test this out for python (at bottom of this post). The results on my laptop are about 8.0 seconds for unsorted and 7.6 seconds for the sorted version. I'm assuming that the discrepancy for python is much smaller due to the slow nature and high overhead of the language (or at least the way I've used it here), but I would be interested to know: how would one go about finding out what the python interpreter is doing beneath the surface?

Edit: After running with a wider range of parameters, it seems that the difference is always roughly the same order of magnitude. To investigate further, I included the sort into the second timing to double check and for 3276800 elements it's still a bit faster overall when you sort the array.

  #!/usr/bin/env python
  import time
  import numpy as np
  
  def main(n=32768):
      arr = np.random.randint(0, 256, n)
      t0 = time.time()
      sum1 = do_loop(arr)
      t1 = time.time()
  
      arr = np.sort(arr)
      t2 = time.time()
      sum2 = do_loop(arr)
      t3 = time.time()
  
      assert sum1 == sum2
      print(" Unsorted execution time: {} seconds".format(t1-t0))
      print(" Sorted execution time:   {} seconds".format(t3-t2))
  
  def run_many(func):
      def wrapper(arg):
          for t in range(1000):
             func(arg)
          return func(arg) 
      return wrapper
  
  @run_many
  def do_loop(arr):
      tot = 0
      for i in arr:
          if i >= 128:
              tot += i
      return tot
  
  if __name__ == '__main__':
      main()

Cyphase · on Feb 20, 2017

I tried this on my machine, then tried a pure Python version; I only changed three lines, to:

  import random
  ...
  arr = [random.randint(0, 256) for x in range(n)]
  ...
  arr = sorted(arr)

Here are my times:

  $ time python2.7 hackernews_13682929.py 
   Unsorted execution time: 4.33348608017 seconds
   Sorted execution time:   4.09405398369 seconds
  
  $ time python3.5 hackernews_13682929.py 
   Unsorted execution time: 4.4200146198272705 seconds
   Sorted execution time:   4.188237905502319 seconds
  
  $ time python2.7 hackernews_13682929_purepython.py
   Unsorted execution time: 0.981621026993 seconds
   Sorted execution time:   0.832424879074 seconds
  
  $ time python3.5 hackernews_13682929_purepython.py
   Unsorted execution time: 1.3005650043487549 seconds
   Sorted execution time:   1.157465934753418 seconds
  
  $ time pypy hackernews_13682929_purepython.py
   Unsorted execution time: 0.239459037781 seconds
   Sorted execution time:   0.0910339355469 seconds

As you can see, the pure Python version is faster than the Numpy version, and also has a larger margin between unsorted and sorted. PyPy is of course faster than both, and also has an even greater margin between unsorted and sorted (2.63x faster).

MooMooMilkParty · on Feb 20, 2017

Good call on going pure python. To take this a bit further I made your changes and used numba with @jit(nopython=True, cache=True), for some interesting results. If I do include the sorting into the timing:

    Unsorted execution time: 0.2175428867340088 seconds
    Sorted execution time:   1.133354663848877 seconds

And if I don't:

    Unsorted execution time: 0.21171283721923828 seconds
    Sorted execution time:   0.08376479148864746 seconds

rasz_pl · on Feb 20, 2017

Using python for micro benchmarks usually doesnt work. The mere fact you use python says you dont care about performance, but convenience.

Just for fun add "randomvariable = 1"(you know, 1 one cycle machine op) in a tight loop and watch tenths of a second added to your "benchmark".

Cyphase · on Feb 20, 2017

I'm not quite sure what you mean by "using python for micro benchmarks usually doesn't work". To the extent that "microbenchmarking" is useful at all, then sure it works. There are just different considerations from other languages, and it depends on which Python implementation and libraries you're using.

Also, while I'll grant you that using Python implies that convenience, programmer time, and/or development speed is a higher priority than performance, that doesn't at all mean that people who use Python "don't care about performance".

kccqzy · on Feb 20, 2017

How is this article related to branch prediction? I don't see any connection here.

nitinreddy88 · on Feb 20, 2017

I bet you didnt read the answer for that :)

Retr0spectrum · on Feb 20, 2017

I read it a while ago, and going on what I remember, I still can't see the connection. Would you mind explaining?

umanwizard · on Feb 20, 2017

I don't think it's related to branch prediction in particular, just spooky action-at-a-distance where a change makes seemingly unrelated code much slower or faster.

alexeldeib · on Feb 20, 2017

What makes you say this? The author provides an extremely persuasive case, with perf timings. Additionally updated it with compiler enhancements from GCC and Intel that remove the branch mispredictions entirely and do perform as predicted.

umanwizard · on Feb 20, 2017

I think you misunderstood me. I agree that the StackOverflow post has to do with branch prediction. I just meant that I don't think that's why the earlier poster thinks it's parallel to the situation described in the article.

alexeldeib · on Feb 21, 2017

Ah, my mistake. I thought the OP a bit further up didn't understand the relation of the stack overflow answer to the array question.

jstarks · on Feb 20, 2017

This is really surprising. I thought that part of Rust's pitch was that the explicit ownership tracking made optimizations much easier.

Is there a bug filed to fix this?

akiselev · on Feb 20, 2017

It does make optimization easier but until MIR landed, many of the best optimizations weren't really possible. The problem is that a lot of type information is lost between Rust and LLVM IR, where the compiler does the really serious optimizations. For example, Rust can't tell LLVM about its pointer aliasing guarantees (immutable and mutable borrows can't be done at the same time without unsafe) so a lot of optimizations like storing a highly used value from the heap in a register are passed over because of conservative heuristics.

Now that MIR has landed, Rust will eventually get much better optimizations from both rustc and the llvm optimization passes but other things are a much higher priority like non-lexical scoping.

aleden · on Feb 20, 2017

> For example, Rust can't tell LLVM about its pointer aliasing guarantees

False.

http://llvm.org/docs/LangRef.html#noalias-and-alias-scope-me...

Rusky · on Feb 20, 2017

From what I understand, LLVM still doesn't take as much advantage of that information as it could, given Rust input. It's too geared toward the C family of languages. (But as sibling comment says, the problem was partially the Rust compiler's fault.)

gens · on Feb 20, 2017

C has a pointer aliasing keyword, "restrict".

Also: "Originally implemented for C and C++, the language-agnostic design of LLVM has since spawned a wide variety of front ends: languages with compilers that use LLVM include ActionScript, Ada, C#,[4][5][6] Common Lisp, Crystal, D, Delphi, Fortran, OpenGL Shading Language, Halide, Haskell, Java bytecode, Julia, Lua, Objective-C, Pony,[7] Python, R, Ruby, Rust, CUDA, Scala,[8] and Swift."

Most of these are nothing like C when it comes to pointers and memory layout.

Rusky · on Feb 20, 2017

Again, from what I understand, `restrict` is not enough to convey everything Rust knows.

Further, those other languages may be nothing like C, but they're even less like Rust. So because of its heritage, even given the information Rust knows, LLVM simply doesn't have the optimization passes to take full advantage of it.

gens · on Feb 20, 2017

Ok. Like what ?

algesten · on Feb 20, 2017

The point was that Rust _can_ (more easily) tell LLVM now that MIR has landed.

jplatte · on Feb 20, 2017

It's not only a MIR issue, it's also an issue of how many optimizations should be allowed when you also need to be able to make some assumptions about your data in unsafe blocks. For example if you simply added the obvious attributes everywhere, interior mutability (Cell, RefCell, UnsafeCell) would most certainly break. The exact rules around unsafe rules are still being discussed (see https://github.com/nikomatsakis/rust-memory-model), so until that is done some of these optimizations are very unlikely to be implemented because they can make a lot of unsafe code illegal.

dbaupp · on Feb 20, 2017

Note that the compiler already adds "obvious" attributes to function calls, being careful to not emit them for things that contain an `UnsafeCell`. This is why that type exists, as the building block for interior mutability that the compiler understand and can optimise around.

fusiongyro · on Feb 20, 2017

I guess having taken a compilers class a decade ago has given me a Dunning-Kruger sense of optimism but the explanation sounds kind of like bullshit to me. So what if LLVM doesn't have the type information? There's no type information at all in the machine code… optimize stuff yourself before you hand it to LLVM?

dbaupp · on Feb 20, 2017

LLVM is literally designed as an production quality optimisation/code-generation backend, so it's seems perfectly reasonable for projects to rely on LLVM's optimisations without doing their own, especially when there are other things people can work on than the large effort to optimise some programs slightly better (and even fewer programs a lot better, but these are often microbenchmarks, or tight loops like the OP that can be diagnosed with a profiler relatively easily and rewritten) by doing language-specific optimisations. It's definitely a good long-term goal, but it requires a lot of infrastructure, and ends up requiring duplicating a lot of the optimisations and analysis already in LLVM (e.g. inlining is a critical transformation for enabling other optimisations) before one can get great results from the custom optimisations.

In any case, compiler optimisations are usually easy to implement on specific styles of internal representations, and this is MIR for Rust, which was only introduced recently, after requiring a very large refactoring of the compiler's internal pipeline.

yazaddaruvala · on Feb 20, 2017

As I understand it, this is partially the plan. Historically it has just been waiting for a compiler refactor to MIR (which is now complete). I'm not sure why more optimization RFCs haven't been created and prioritized.

Vexs · on Feb 20, 2017

I think it's kinda interesting what tiny little tweaks will affect code speed- I recently discovered that in python, if you're only going to use an import for one or two functions, importing it locally shaves off a good bit of time depending on the function, in my case, .2 seconds!

Cyphase · on Feb 20, 2017

I don't think the function itself matters; what's going on is that if you import it locally, the reference you're using is in the local namespace versus being in a module's namespace, which means it takes less time to get to the function object. I'm guessing the function is being called a fair number of times in a loop or some such similar manner?

fnord123 · on Feb 20, 2017

You could use `from some_module import some_func` to get the same effect as some_func will be in the local namespace.

Cyphase · on Feb 20, 2017

That's what the top-level commenter was doing.

fnord123 · on Feb 20, 2017

Ok. It wasn't clear what "importing it locally" meant. I've worked with some people who would use that phrase to mean "copy paste it into the local file".

Cyphase · on Feb 20, 2017

Fair point :P. I think I might have seen someone mean it like that as well in the past.

masklinn · on Feb 20, 2017

And a common stdlib trick is to bind these via function arguments e.g.

    def foo(a, b, thing=util.thing):
        ...

bbernoulli · on Feb 20, 2017

I agree, seems likely. gp can use dis.dis() to confirm this...

fnord123 · on Feb 20, 2017

Python is notable for how it harasses the filesystem. If your python program is running off NFS or parallel filesystems (e.g. gpfs, panasas, lustre, etc) then you may get a lot more performance from avoiding imports. Especially if pkg_resource is being used.

Too · on Feb 20, 2017

    like all generic interfaces in Rust, printing takes arguments by reference, 
    regardless of whether they are Copy or not. The println! macro hides this from 
    you by implicitly borrowing the arguments,

What's the reason for this? Seems a bit inconsistent that for functions you must explicitly pass with &, but for print it's automatic.

dbaupp · on Feb 20, 2017

The reason I can think of are:

- it'd be annoying to have to write `println!("{}", &s);`,

- this is not an inconsistency one thinks about in practice: ownership is important in Rust, but the compiler does all the annoying checking, so IME one can just let it fade into the background and not think about it until the compiler tells you there's a problem,

- it's an "permissive" inconsistency: writing a & will still compile and give the same output,

- it's a syntax-level inconsistency created by the println! macro packaging up the normal syntax, not some special rules for special functions: there's still a literal & borrow in there (see the macro expansion),

- historical reasons, and no-one was annoyed enough (or even noticed it enough) to change it.

Too · on Feb 20, 2017

Hmm. If people would think it was annoying to println!("{}", &s) they wouldnt use rust in the first place because you have to do that everywhere else for plain functions.

Beeing permissive makes it even worse as there are now 2 inconsistent ways to do the same thing.

Maybe I need to understand rust macros better. I guess this all boils down to: macros are a very bad idea, as in other programming languages.

acqq · on Feb 20, 2017

And if print can already have a special treatment, it should surely be the one where the value is copied if can fit the basic type size, that is, at least for int like and float like stuff not have automatic pass by reference but automatic pass by value to it?

JelteF · on Feb 20, 2017

The problem with this is print is a macro and that macros are expanded before the type checker. This makes it impossible to know if something can be copied or not, because this is type information.

acqq · on Feb 20, 2017

I believe it could be a solvable problem? Imagine that the macro expands every parameter x to some language construct f( x ) where f( x ) has access to the type of x and the type of its result is different depending of the type of x, returning a value for for int and float-like variables, and the reference for others.

domoritz · on Feb 20, 2017

Since LLVM does not have the necessary information to do the optimizations, I wonder whether the same problem occurs in C++ code compiled with clang.

DannyBee · on Feb 20, 2017

It depends on the system. First, the int would be by value anyway, so it wouldn't matter. But assume you handed it a pointer instead.

Some are properly marked readonly in declarations, some are not[1]

But this is a common problem with unmarked logging. The compiler will infer attributes all the way through things it has bitcode for. (IE it can determine it to be readonly, readnone, whatever, as long as it can see calls to either functions marked readonly/readnone, or determine the called functions to be readonly/readnone)

But, usually, if you don't give it the whole program, it can't infer that your logging function does not modify its arguments.

[1] Fun fact: glibc allows printf handler registration: https://www.gnu.org/software/libc/manual/html_node/Customizi...

So sadly, it's printf is not readonly, because it can do anything it wants in these callbacks.

(nobody uses the printf registration, though, so it may be time to just add a compiler option required to make it work, and not screw everyone else using printf over)

sharth · on Feb 20, 2017

In this particular case, you'd likely be using printf or cout, both of which perform a copy of integer parameters.

So you wouldn't see this there.

msbarnett · on Feb 20, 2017

It may not. It's possible that this may be happening because rustc is failing to mark the reference as both readonly and unaliased, whereas clang might mark the equivalent as both. You'd need to examine the IR from both front ends to really figure out the source of any differences in the generated ASM

shepmaster · on Feb 20, 2017

Both Rust playgrounds (official[1], my take[2]) allow viewing the LLVM IR of a Rust program.

The call to the print is

      call void @_ZN3std2io5stdio6_print17h690779b3bd8114d5E(%"core::fmt::Arguments"* noalias nocapture nonnull dereferenceable(48) %_3)

The entire chunk of `main` preceding that call:

      %size = alloca i64, align 8
      %_3 = alloca %"core::fmt::Arguments", align 8
      %_8 = alloca [1 x %"core::fmt::ArgumentV1"], align 8
      %0 = bitcast i64* %size to i8*
      call void @llvm.lifetime.start(i64 8, i8* %0)
      store i64 33554432, i64* %size, align 8
      %1 = bitcast %"core::fmt::Arguments"* %_3 to i8*
      call void @llvm.lifetime.start(i64 48, i8* %1)
      %2 = bitcast [1 x %"core::fmt::ArgumentV1"]* %_8 to i8*
      call void @llvm.lifetime.start(i64 16, i8* %2)
      %3 = ptrtoint i64* %size to i64
      %4 = bitcast [1 x %"core::fmt::ArgumentV1"]* %_8 to i64*
      store i64 %3, i64* %4, align 8
      %5 = getelementptr inbounds [1 x %"core::fmt::ArgumentV1"], [1 x %"core::fmt::ArgumentV1"]* %_8, i64 0, i64 0, i32 1
      %6 = bitcast i8 (%"core::fmt::Void"*, %"core::fmt::Formatter"*)** %5 to i64*
      store i64 ptrtoint (i8 (i64*, %"core::fmt::Formatter"*)* @"_ZN4core3fmt3num54_$LT$impl$u20$core..fmt..Display$u20$for$u20$usize$GT$3fmt17hb872170870cc06d9E" to i64), i64* %6, align 8
      %7 = getelementptr inbounds [1 x %"core::fmt::ArgumentV1"], [1 x %"core::fmt::ArgumentV1"]* %_8, i64 0, i64 0
      %8 = getelementptr inbounds %"core::fmt::Arguments", %"core::fmt::Arguments"* %_3, i64 0, i32 0, i32 0
      store %str_slice* getelementptr inbounds ([2 x %str_slice], [2 x %str_slice]* @ref.8, i64 0, i64 0), %str_slice** %8, align 8, !alias.scope !1, !noalias !4
      %9 = getelementptr inbounds %"core::fmt::Arguments", %"core::fmt::Arguments"* %_3, i64 0, i32 0, i32 1
      store i64 2, i64* %9, align 8, !alias.scope !1, !noalias !4
      %_6.sroa.0.0..sroa_idx.i = getelementptr inbounds %"core::fmt::Arguments", %"core::fmt::Arguments"* %_3, i64 0, i32 1, i32 0, i32 0
      store %"core::fmt::rt::v1::Argument"* null, %"core::fmt::rt::v1::Argument"** %_6.sroa.0.0..sroa_idx.i, align 8, !alias.scope !1, !noalias !4
      %10 = getelementptr inbounds %"core::fmt::Arguments", %"core::fmt::Arguments"* %_3, i64 0, i32 2, i32 0
      store %"core::fmt::ArgumentV1"* %7, %"core::fmt::ArgumentV1"** %10, align 8, !alias.scope !1, !noalias !4
      %11 = getelementptr inbounds %"core::fmt::Arguments", %"core::fmt::Arguments"* %_3, i64 0, i32 2, i32 1
      store i64 1, i64* %11, align 8, !alias.scope !1, !noalias !4
      call void @_ZN3std2io5stdio6_print17h690779b3bd8114d5E(%"core::fmt::Arguments"* noalias nocapture nonnull dereferenceable(48) %_3)

[1]: https://play.rust-lang.org/

[2]: http://play.integer32.com/

msbarnett · on Feb 20, 2017

Am I reading it right? It looks to me like rustc didn't mark the immutable borrow as readonly, so LLVM went ahead and assumed print could mutate it.

dbaupp · on Feb 20, 2017

I'm not 100% sure, but I believe LLVM doesn't have a way to understand the (im)mutability of pointers rewritten into memory, like the borrow is here.

fnord123 · on Feb 20, 2017

If you use printf and cout in the same program then yes you can have dramatic performance problems with cout unless you use `std::ios_base::sync_with_stdio(false);`

It's not the same issue but it's the same area of stupid stuff you have to find out about and roll your eyes.

lowbloodsugar · on Feb 20, 2017

This is awful. Was just reading how Rust is all shiny and special and much better than awful, naughty C++, and now I read how the print method is magic goo because "developers don't want to type & in front of ints". Back in my day, bah, grumble...

steveklabnik · on Feb 20, 2017

One of the reasons that ! is in macros is to let you know that they can do near-arbitrary things inside of their ()s.