How Rust 1.26 more than tripled the speed of my code

pbsd · on May 14, 2018

The code in the blog post does not match what is actually benchmarked. The reason `u256_full_mul` is so much faster than the inline assembly version is that it omits the upper part of the result, due to how the benchmark is done (cf. https://github.com/paritytech/bigint/blob/44a3133030306d9fcc...). There should be 16 full 64x64->128 multiplications (unless you're doing Karatsuba, which is not the case here) in a full 256-bit multiplication, but I only count 6 in the benchmark's inner loop (plus 4 64x64->64 multiplications). No wonder it's faster.

The disassembly for the Rust output of `u256_full_mul` at the end is also terrible, and it looks like the output of a debug build. There's no reason an optimizing compiler should be outputting `pushfd` and `popfd` in performant code. The benchmarked loop does not have those instructions.

ajross · on May 14, 2018

PUSH/POPF* instructions push the flags register to save comparison results when you need to reuse the registers containing the original comparison arguments. They're well-pipelined, integrated with the stack engine and very fast on modern CPUs, and I've definitely seen the compiler emit them (more on i686 than x86_64 to be fair).

I won't speak to the quality of the generated code otherwise, but this bit of evidence by itself doesn't seem persuasive.

(edit: pbsd is right: per Agner Fog, these guys are slow. It's likely that the L/SAHF trick is the one I was remembering.)

pbsd · on May 14, 2018

Where are you getting that from? `pushfd` is OK-ish, but `popfd` is microcoded, requires 9 uops, and can only be issued once every ~20 cycles. That's not what I would call well-pipelined.

You could probably achieve the same result (assuming you want to avoid adc instructions, for some reason) by using setc r8 plus shr r8, 1 to get the carry back.

nonsince · on May 14, 2018

If you use `-C target-cpu=native` (Rust's equivalent of `-march=native`) you get code that uses `mulxq` in order to avoid `pushf`/`popf`. https://gist.github.com/Vurich/5cb83c773e90fc7a463ccb58e1dad...

pbsd · on May 14, 2018

It's still doing it, but now it uses `lahf` + `sahf` instead to (re)store the flags. These are better than pushf + popf for sure, but they cannot be used in general code because some early x86_64 chips forgot to implement them.

deaddodo · on May 14, 2018

If your general code is only intended to be used on newer versions of Windows, you can. Windows has required it since 8.1.

nonsince · on May 14, 2018

It's possible that `pushfq`+`popfq` get turned into a `mov` from the flag register into a the output register due to microcode-level optimisations.

pbsd · on May 14, 2018

That is not the case. On a Skylake chip, a pushfd + popfd roundtrip costs 24 cycles.

feikname · on May 14, 2018

Just out of curiosity, how do you know this? Is it some kind of measurement you made yourself, written on Intel docs somewhere or something else?

I've been wondering about how much instructions cost lately and would like to be able to know this without the need of asking others.

pbsd · on May 14, 2018

Agner's instruction listings [1] and InstLatX64 [2] are good sources for a variety of chips. The 24 cycle figure above can be seen at https://github.com/InstLatx64/InstLatx64/blob/ee13abcfb1e2bb...

[1] http://www.agner.org/optimize/#manuals

[2] https://github.com/InstLatx64

koverstreet · on May 14, 2018

There's also the intel optimization manuals.

comex · on May 14, 2018

The behavior with "m" is quite strange. In theory, it should produce a memory operand pointing directly to where the function argument was passed on the stack, rather than copying it to registers and then back onto the stack. I can reproduce this with Rust nightly:

https://play.rust-lang.org/?version=nightly&mode=release

However, Clang appears to do the right thing with equivalent C code:

https://godbolt.org/g/B1sjfS

This is strange because Rust and Clang both use LLVM for their backend.

Looking at the LLVM IR produced, Rust's IR straightforwardly corresponds to the source code: a load instruction (to evaluate `self_t[0]`) whose result is passed to an asm instruction:

    %0 = getelementptr inbounds %U256, %U256* %self, i64 0, i32 0, i64 0
    %1 = load i64, i64* %0, align 8
    tail call void asm sideeffect "mulq $0", "m,~{dirflag},~{fpsr},~{flags}"(i64 %1) #1, !srcloc !0

But Clang's omits the load, and transforms the "m" constraint to "* m" (without the space, bah HN formatting):

    %2 = getelementptr inbounds %struct.U256, %struct.U256* %0, i64 0, i32 0, i64 0, !dbg !24
    call void asm sideeffect "mulq $0", "*m,~{dirflag},~{fpsr},~{flags}"(i64* nonnull %2) #2, !dbg !26, !srcloc !27

Hmm… seems to be related to this Rust bug: https://github.com/rust-lang/rust/issues/16383

tomsmeding · on May 14, 2018

> especially since the original code used highly-optimised assembly incantations from our resident cycle wizard.

He/she might be cool, but I think someone familiar with assembly optimisation would think to look at the generated assembly as well, after noticing the speedup is not what you expect. Don't go hand-writing assembly routines thinking your code is fast without actually checking whether your code is fast. :)

nonsince · on May 14, 2018

They're certainly familiar with assembly itself, but the `asm!` macro and the related constructs in Clang/GCC is its own beast and has a lot of footguns (like GCC allocating registers as overlapping by default). I don't begrudge them writing suboptimal assembly, since it already beat our Rust implementation by a significant margin.

stochastic_monk · on May 14, 2018

And correct.

jsd1982 · on May 14, 2018

Are you sure those are missed optimization opportunities where llvm copied values to registers as temporary storage instead of a direct copy to the destination register? I saw use of the temporary registers below the lines in question. Perhaps it was just a cheap copy of the value to a register which will be used later because the original destination register was clobbered.

Also I would hazard a guess that pushing flag state into a register is a lot faster than the stack.

Finally, it'd be interesting to see the original hand-rolled inline assembly used with "r" instead of "m" to see how llvm stacks up against the intended version without the redundant memory load/stores.

nonsince · on May 14, 2018

Great catch, you're absolutely right. LLVM is smarter than me once again.

the_new_guy_29 · on May 14, 2018

"decreased the execution time by a factor of 3"

So article is about a flew in previous versions of Rust not supporting 128-bit numbers on x86_64. But title makes it sound like "switching to Rust was the breakthrough". Or am i paranoid about it..

codetrotter · on May 14, 2018

> But title makes it sound like "switching to Rust was the breakthrough". Or am i paranoid about it..

Since titles keep changing on HN (for good reasons), I can’t be sure you saw the same title that I did, but if you did I disagree.

The current title is “How Rust 1.26 more than tripled the speed of my code”.

Because that title specifies a specific version number, I expected the blog post to be about exactly the kind of thing it was; some new feature or improvement in that version of Rust which would make their software run faster compared to the previous version of Rust they were using.

infogulch · on May 14, 2018

They probably wouldn't have mentioned the specific version if they were talking about switching languages instead of switching compiler versions.

viraptor · on May 14, 2018

The original title is the more precise "How a Rust upgrade more than tripled the speed of my code". Also I think mentioning the version made it clear that it's about a specific version, not about rust in general.

magduf · on May 14, 2018

Maybe they should change it to: "How an upgrade to Rust 1.26 more than tripled the speed of my code"

rkagerer · on May 14, 2018

I'd say its more about how LLVM outperformed their previously handwritten assembly (the latter was impeded by crud from the asm! macro).

Ygg2 · on May 14, 2018

That would have done nothing, if Rust didn't stabilize u128/i128.

kibwen · on May 14, 2018

Happy to see alternative language features reducing the need for inline assembly, not just with the 128-bit integer support in 1.26 but also with the stable SIMD support coming in 1.27 (so crates like https://users.rust-lang.org/t/jetscii-now-works-with-future-... can at last work on stable). Sadly the current implementation of inline assembly is so inextricably tied to LLVM that there's resistance to stabilizing it in its current form, though in this case I think the perfect is being allowed to be the enemy of the good^W good-enough^W janky-but-acceptable.

the8472 · on May 14, 2018

There's a proposal to introduce a double wide mul which should make this much more ergonomical to solve.

https://github.com/rust-lang/rfcs/pull/2417

hoare · on May 14, 2018

What other factors played a role in your decision to use rust over other languages when 128bit arithmetic played such an important role you wrote inline assembly for it?

gameswithgo · on May 14, 2018

I imagine mistakes can be extremely expensive in cryptocurrency world so a language that may help prevent them and be fast would be a solid choice.

nonsince · on May 14, 2018

Higher-precision arithmetic is important but is not our primary product. The existing code was performant enough for our usecase (although the increased speed is helpful). Rust has many benefits and I find it hard to imagine how anyone manages to get any performance-sensitive work done in the cryptocurrency industry with anything else.

lmm · on May 14, 2018

What other languages are sensible options for when you need to work with inline assembly?

e12e · on May 14, 2018

From just reading the discussion so far, I'd like to see how a naive implementation in Julia would do (Julia does type inference, and some quick duck-duck-ing indicates that unsigned 128 bit int is the biggest regular number type - before having to go to bigint).

[ed: ok, from a skim, I see this is actually about 256bit multiplication - which makes me curious how just using bigints and * (mul operator) would work in Julia.

Also, I don't get this:

> u256_mul multiplies two 256-bit numbers to get a 256-bit result (in Rust, we just create a 512-bit result and then throw away the top half but in assembly we have a seperate implementation)

How do they know the result will fit in 256 bits? Sounds like they know more about the arguments than both having to be 256 bits long?

Or do they want multiply and shift?]

vchuravy · on May 15, 2018

Julia has `widemul` which takes two `Int64` and produces an `Int128`. It is also not to hard to actually add your own primitive type for `Int256` (with the caveat that it needs support from LLVM, which I haven't checked)

Retra · on May 15, 2018

That doesn't say it will fit in 256 bits, it says it is truncated to 256 bits.

nonsince · on May 14, 2018

C has had strong support for inline assembly for a long time, and in fact for a short period we called out to C in order to use inline assembly on stable (the function call overhead was worthwhile compared to how much faster the assembly implementation of the function was).

Const-me · on May 14, 2018

Their assembly implementation can be improved, at least when targeting modern hardware.

Here’s an article explaining how to use new instructions to implement large integers multiplication: http://www.intel.com/content/dam/www/public/us/en/documents/...

nonsince · on May 14, 2018

If you use Rust's equivalent of `-march=native` you get faster code that can target modern hardware. We would have to get another factor-of-two speedup for the maintainance burden of using inline assembly to be worthwhile.

https://gist.github.com/Vurich/5cb83c773e90fc7a463ccb58e1dad...

FPGAhacker · on May 14, 2018

>”When Rust does a * b when a and b are both u64 the CPU actually multiplies them to create a 128-bit result and then Rust just throws away the upper 64 bits.”

Is that right? Wouldn’t that answer look like nonsense? I would have guessed it tossed the lower 64bits so it looked like rounding by truncation at least.

OskarS · on May 14, 2018

That's how floats work, not ints. Unsigned integers more or less always wrap on overflow, which is what you want, generally speaking. Essentially, the operation is carried out modulo the size of the type.

For signed integers, overflow is undefined in C/C++, but generally results in wrapping as well.

chaosite · on May 14, 2018

We're talking about Rust, not C/C++.

gpm · on May 14, 2018

Just to record the difference.

Unsigned and signed overflow in rust are both well defined to do one of two things, panic or wrap. If debug assertions are enabled they are defined to panic. In practice if debug assertions are not enabled they will wrap, but I believe that is technically subject to change (into a panic). It will realistically only change if the hardware gets much better at catching overflows quickly.

A panic is usually like a exception that you are really not expected to ever catch except at thread boundaries. You can turn them into a immediate process exit with a compiler flag.

steveklabnik · on May 14, 2018

> I believe that is technically subject to change

Yes, that's correct.

mmirate · on May 14, 2018

> except at thread boundaries

Also at FFI boundaries. (i.e. a panicky Rust function, supplied to foreign code as a callback)

OskarS · on May 14, 2018

Yeah, but I interpreted the question as "is that how it should work?", and then it's useful to compare with different languages. My point was that this is standard behaviour, but there's a slight quirk in C when the ints are signed.

rmc · on May 14, 2018

FTR Rust's types (like u64) have several methods for different types of multiplication if you want: checked, saturating, wrapping, overflowing

https://doc.rust-lang.org/std/primitive.u64.html#method.chec...

chrisseaton · on May 14, 2018

Well semantics for a number that's too big to fit into the result can be debated, so there's no one right answer and all are nonsense in some way or another.

You talk about rounding but if the number's too big then then you're only ever going to round to the max number you could use, so we might as well call that saturation.

Discarding the upper bits means you are implementing modular arithmetic.

Both saturation and modular arithmetic are well defined, and can both be reasonable choices. I think most languages implement modular arithmetic, so that's what most programmers expect, and it seems a reasonable choice to me compared to saturation.

littlestymaar · on May 14, 2018

If you just keep the higher 64 bits, what kind of sense does it make ? We're talking about integer here, not float.

At least, keeping the lower 64 bits is the usual overflow behavior and if the result of the multiplication fits in 64 bits it just works fine.

barrkel · on May 14, 2018

There are cases for keeping just the high bits.

For example, many language runtime libraries have pseudo random number generators (PRNGs) use a 32-bit linear congruential generator (LCG). Suppose you want to create a rand(n) function which returns an integer between 0 and n-1. One good way of doing this is multiplying the lcg_output (a 32-bit unsigned integer) by n, and returning the high 32 bits.

It's better to do this than a modulus operation because the low bits on LCG sequences are highly predictable.

(This is how the Delphi Random(n) function works.)

comex · on May 14, 2018

By the way, it's only on x86-64 that the CPU's multiplication instruction always returns both the upper and lower 64 bits (in two different registers).

On, say, ARM64, there are two separate instructions: MUL is the normal one that takes two 64-bit inputs and produces one 64-bit output (the lower half of the product), while UMULH ("Unsigned Multiply High") performs the same multiplication but gives you only the upper half. This way, in the common case where you don't care about the upper half, the CPU doesn't have to waste time calculating it.

monocasa · on May 14, 2018

I wouldn't say that it's only x86. MIPS does the same thing, but doesn't even stick the result in the GPRs, you issue a multiply, and the result ends up in the hi and lo registers that you have to mfhi and mflo to move back into GPRs.

comex · on May 14, 2018

Oh, good point, I've seen that before but forgot about it. Though to be fair, that's 32x32->64, not 64x64->128 :)

monocasa · on May 14, 2018

The combined HI and LO are 64x64->128 on Mips64. You just have to use dmult instead of mult.

stephencanon · on May 15, 2018

Huh? IMUL is perfectly happy to give you just the low-half of the product.

DSingularity · on May 14, 2018

It doesn’t make sense to throw away either. Remember the 64 bit numbers can be small too. So throwing away the lower will result in 2*2=0.

monocasa · on May 14, 2018

It's right. This is all integer arithmetic, so you wouldn't want to just drop the lower 64bits.

omaranto · on May 14, 2018

I think it's correct as stated. Usually machine integer arithmetic is performed modulo 2^bits.

andrepd · on May 14, 2018

Could this be compared with C/gcc and C++/g++? Because this is what Rust is aiming to displace.

steveklabnik · on May 14, 2018

Sure; this is exposing LLVM's support, so it should be the same as clang.

tzahola · on May 14, 2018

* compared to Rust 1.25

alimbada · on May 14, 2018

An unnecessary distinction as it can quite be easily inferred from the current title. Despite this there seem to be a lot of people nitpicking on semantics of the title in this thread. It's unnecessary noise and detracts (and distracts) from the real conversation.

paulmd · on May 14, 2018

It's not just this thread. HN is absolutely obsessed with nitpicking titles and yes, it's distracting and detracts from actual discussion of the topics.

The moderation team here tacitly encourages it by catering to it. I've seen some threads go through 2 or 3 title changes as people find something new to complain about.

Really it's almost a form of editorializing in itself - the title of the article is what it is, people here just think they know better than the author.