I can explain what happened to Array case. 100000 used to be the threshold at wh...

mraleph · on Jan 19, 2015

Ok reporting back. There are two issues here.

The first major one is related to mortality of TypedArray's maps (aka hidden classes). When typed array stored in the Data variable is GCed and there are no other Uint8Array in the heap then its hidden class is GCed too. This also causes GC to find and discard all optimized code that is specialized for Uint8Array's and clear all type feedback related to Uint8Array's from inline caches. When we later come and reoptimize - optimizing compiler thinks that cleared type feedback means we need to emit a generic access through the IC (there is reasoning behind that) because this is potentially going to be a polymorphic access anyways. I have filed the issue[1] for the root cause (mortality of typed array's hidden class).

Now there is a second much smaller issue (which also explains performance of the Buffer case) - apparently there were some changes in the optimization thresholds and OSR heuristics. After these changes we hit OSR at a different moment: e.g. I can see that we hit inner loop one that loops over `j` instead of hitting outer loop which leads to better code. In V8 OSR is implemented in a way that tries to produce optimized code that is both suitable for OSR and as a normal function code - this is done by adding a special OSR entry block that jumps to the preheader of the selected loop we are targeting with the OSR. This allows V8 to reuse the same optimized code without optimizing it again for the normal entry - but this also leads to code quality issues if OSR does not hit the outer loop because OSR entry block inhibits code motion. This is a know problem and there are plans to fix it. The hit usually is quite small unless you have very tight nested loops (like in this case).

Disabling OSR (--nouse-osr) not only "solves" the second issue but it also partially fixes (hides)the first issue: 1) we no longer optimize with partial type feedback - so we never emit generic keyed access but always specialize it for the typed array 2) we no longer emit OSR entry - hence no code quality issues related to it.

[1] https://code.google.com/p/v8/issues/detail?id=3824

mschoebel · on Jan 19, 2015

Very interesting. After reading your comment, I tried allocating another Uint8Array and keeping it allocated throughout the entire test as a workaround for the issue you mentioned. Time for Node.js was unchanged, but io.js was down to about 5.5s now. Almost the same time as Node. Only about 10% slower.

The same happens when I use the --nouse-osr parameter that you mentioned.

mraleph · on Jan 19, 2015

Is it 10% slower even if you keep array alive and apply --nouse-osr (to both node.js and io.js)?

On my machine results are fluctuating within the same ballpark (though I am on Linux and benchmarking 64-bit builds).

mschoebel · on Jan 19, 2015

Ok, I hadn't tested with both before. Keeping the array alive and using --nouse-osr makes io.js only 2.3% slower than my original measurement for Node 0.10.35. Median of 5058ms.

And Node 0.10.35 shows basically the same results as before. I see less than 1% difference. Maybe just random fluctuation. Even if not. 1% is irrelevant.

mschoebel · on Jan 22, 2015

I just posted a follow-up blogpost, comparing Node 0.11.15 and io.js 1.0.3 which were both released yesterday.

In that post I also benchmarked the various fixes for the typed-array slowdown you mentioned. BTW --nouse-osr makes all three tests run faster.

http://geekregator.com/2015-01-21-node_js_0_11_15_and_io_js_...

mraleph · on Jan 23, 2015

Thanks for the update.

I posted this reply on your site, but I will duplicate it here for the sake of HN readers:

> BTW --nouse-osr makes all three tests run faster.

As I tried to explain above: OSR at it is implemented now impacts code quality depending on which loop OSR hits. Which in turn depends on heuristics that V8 uses. These heuristics are slightly different in newer V8. As a result of these changes V8 hits inner loop instead of outer loop. This leads to worse code.

Code that benefits from OSR is the code that contains a loop which a) can be well optimized b) runs long b) is run only few times in total. The Sieve benchmark is opposite of this and as a result it doesn't benefit from OSR - you get bigger penalty from producing worse code and no benefit from optimizing slightly earlier.

Not using OSR for Sieve also hides the other issue with mortality of typed array's hidden classes. I say "hides" not "fixes" because one can easily construct a benchmark where the mortality would still be an observable performance issue even if benchmark itself is run without an OSR: https://gist.github.com/mraleph/2942a14ef2a480e2a7a9