Memory Access Microbenchmark: Clojure, F#, Go, Julia, Lua, Racket, Ruby, Self

lqdc13 · on Feb 22, 2014

The Python code is terrible. That's not how you code in Python. Also, this implementation seems to take more than 32 GB on my machine. Here is a refactor: https://gist.github.com/lqdc/9149772

Anyway, this problem should not be solved this way in python either way. I would have stored everything in a Pandas table and computed stuff from there.

logicchains · on Feb 22, 2014

Thanks! I'm rerunning it and updating the tables progressively every few hours as optimisations are suggested. Feel free to make a pull request. Ideally within a day or so the implementations will have all been optimised enough to make for fairer comparison.

I understand that a Pandas table would be a better idea, but the purpose of this benchmark is comparing the speed of the raw languages.

lqdc13 · on Feb 22, 2014

I updated with a pandas script as well.

I think the strength of the languages have in part to do with their libraries. You have to pick the right tools for the right task.

The pandas version runs 3x faster than regular python, but 17x slower than C.

logicchains · on Feb 22, 2014

Great, I'll include that in the next run of the benchmark. Is it Python 3 or 2? (Does it work with Pypy) Also, if you're comparing it to C you should compare it to C3.c, from the final tables, which uses bignums like Python does automatically.

*Edit: I tried your faster_py.py, but it didn't seem any faster than Pypy2.py in the repo (running both with Pypy). I haven't yet got the Pandas version to work due to library compatibility issues (I've got about five versions of Python installed; still working on it).

lqdc13 · on Feb 22, 2014

It's about 2x faster in regular CPython 2.7 on my machine. In pypy it is actually way slower than your original one.

In many cases, pypy cannot be used, because it's not compatible with all the libraries.

logicchains · on Feb 22, 2014

Ah, right. I've included your version (converted to Python 3) as Python3_fst; it's around twice as fast as regular CPython 3.3.

lqdc13 · on Feb 22, 2014

Also there is integer overflow in the C code and maybe others because the sum doesn't fit in 64 bit integer.

logicchains · on Feb 22, 2014

It mentions that later in the post. The implementations in the final two tables use bignums for fairness and to avoid overflow.

mattchamb · on Feb 22, 2014

I had a play around with your F# implementation so its abit cleaner and uses abit more fsharp-ish style:

https://gist.github.com/mattchamb/20019b22ae841ff5ce1b

Im not sure if it is any faster though; as each run varies quite widely from the previous one

logicchains · on Feb 22, 2014

Interesting, thanks. It might be faster than FS.fs but not FS2.fs, as I think:

tradesArray.[i] <- {

    TradeId = idx;

    ClientId = idx;

... }

would lead to the creation of a new object every time whereas:

    trades.[i].TradeId <- (int64 i)

    trades.[i].ClientId <- (int64 1) etc.

doesn't.

taspeotis · on Feb 22, 2014

If you wanted to, you could set GCSettings.LatencyMode [1] to LowLatency [2]

> Enables garbage collection that is more conservative in reclaiming objects. Full collections occur only if the system is under memory pressure, whereas generation 0 and generation 1 collections might occur more frequently ... This mode is not available for the server garbage collector.

Just for fun, you wouldn't want to run in LowLatency for too long.

[1] http://msdn.microsoft.com/en-us/library/system.runtime.gcset...

[2] http://msdn.microsoft.com/en-us/library/system.runtime.gclat...

logicchains · on Feb 22, 2014

Interesting. I suspect for this particular program it'd have to be in low latency mode for the entire run. It doesn't seem to make too much of a different to the C#2 implementation however (the one using sensible allocation), so I suspect that one isn't generating too much garbage.

mattchamb · on Feb 22, 2014

Yeah, thats where I had some problems understanding the intention of the test.

In my edited one, the trade object is immutable, so you could just reuse the same object without editing it. In FS.fs, a whole new set of objects is created/allocated on the heap for each test iteration; whereas in FS2.fs, pre-allocated memory is just being edited in place. The first approach seems much more likely to create GC pressure than the second approach.

Similarly, there are a few differences between the implementation in the different languages. For example, in CS.cs, the entire trades array is being re-allocated in each iteration of the test, whereas in the F# tests, it is allocated up front. Changing the array allocation to once only in the C# implementation caused a large speedup (from about 5 secs down to about 1.5 secs on my machine).

Speaking of the C# implementation, here is a slightly cleaned up version with the altered array allocation: https://gist.github.com/mattchamb/9152487

As a side note, I changed it from using DateTime.Now to using the Stopwatch class. The reason for this is that DateTime.Now has limited timer resolution (about 10ms), which you can see in the remarks section of the msdn docs (http://msdn.microsoft.com/en-us/library/system.datetime.now(...)

mattchamb · on Feb 22, 2014

blah, looked at the other c# implementations and saw you already changed the array initialisation, please ignore that part :D

Edit: yeah, pretty much ignore me, the array initialisation for implementation 1 of F# and C# is almost incomparable to that of the C implementation; which is why the first implementations are so much slower.

If you wanted them to be more directly comparable then for C, you should have an array of pointers which you then fill with pointers to malloc'd addresses.

I should have noticed that sooner... i guess this is why you dont read code at 1am.

logicchains · on Feb 22, 2014

No problem. The style of implementations 1 of F# and C# is meant to mimic the style of Java1, the implementation from the original post that inspired the benchmark. F#2 and C#2 use a faster form of allocation. JavaUnsafe uses an implementation more directly comparable to C, but it's actually slower than the Java2 implementation.

logicchains · on Feb 22, 2014

It ran at about the same speed as my FS.fs, so I've updated FS.fs in the repo to use your code. Interestingly, I had to change "let main() =" to "let main =" to make it run.