The code in the blog post does not match what is actually benchmarked. The reason `u256_full_mul` is so much faster than the inline assembly version is that it omits the upper part of the result, due to how the benchmark is done (cf. https://github.com/paritytech/bigint/blob/44a3133030306d9fcc...). There should be 16 full 64x64->128 multiplications (unless you're doing Karatsuba, which is not the case here) in a full 256-bit multiplication, but I only count 6 in the benchmark's inner loop (plus 4 64x64->64 multiplications). No wonder it's faster.
The disassembly for the Rust output of `u256_full_mul` at the end is also terrible, and it looks like the output of a debug build. There's no reason an optimizing compiler should be outputting `pushfd` and `popfd` in performant code. The benchmarked loop does not have those instructions.
PUSH/POPF* instructions push the flags register to save comparison results when you need to reuse the registers containing the original comparison arguments. They're well-pipelined, integrated with the stack engine and very fast on modern CPUs, and I've definitely seen the compiler emit them (more on i686 than x86_64 to be fair).
I won't speak to the quality of the generated code otherwise, but this bit of evidence by itself doesn't seem persuasive.
(edit: pbsd is right: per Agner Fog, these guys are slow. It's likely that the L/SAHF trick is the one I was remembering.)
Where are you getting that from? `pushfd` is OK-ish, but `popfd` is microcoded, requires 9 uops, and can only be issued once every ~20 cycles. That's not what I would call well-pipelined.
You could probably achieve the same result (assuming you want to avoid adc instructions, for some reason) by using setc r8 plus shr r8, 1 to get the carry back.
It's still doing it, but now it uses `lahf` + `sahf` instead to (re)store the flags. These are better than pushf + popf for sure, but they cannot be used in general code because some early x86_64 chips forgot to implement them.
The behavior with "m" is quite strange. In theory, it should produce a memory operand pointing directly to where the function argument was passed on the stack, rather than copying it to registers and then back onto the stack. I can reproduce this with Rust nightly:
This is strange because Rust and Clang both use LLVM for their backend.
Looking at the LLVM IR produced, Rust's IR straightforwardly corresponds to the source code: a load instruction (to evaluate `self_t[0]`) whose result is passed to an asm instruction:
> especially since the original code used highly-optimised assembly incantations from our resident cycle wizard.
He/she might be cool, but I think someone familiar with assembly optimisation would think to look at the generated assembly as well, after noticing the speedup is not what you expect. Don't go hand-writing assembly routines thinking your code is fast without actually checking whether your code is fast. :)
They're certainly familiar with assembly itself, but the `asm!` macro and the related constructs in Clang/GCC is its own beast and has a lot of footguns (like GCC allocating registers as overlapping by default). I don't begrudge them writing suboptimal assembly, since it already beat our Rust implementation by a significant margin.
Are you sure those are missed optimization opportunities where llvm copied values to registers as temporary storage instead of a direct copy to the destination register? I saw use of the temporary registers below the lines in question. Perhaps it was just a cheap copy of the value to a register which will be used later because the original destination register was clobbered.
Also I would hazard a guess that pushing flag state into a register is a lot faster than the stack.
Finally, it'd be interesting to see the original hand-rolled inline assembly used with "r" instead of "m" to see how llvm stacks up against the intended version without the redundant memory load/stores.
So article is about a flew in previous versions of Rust not supporting 128-bit numbers on x86_64. But title makes it sound like "switching to Rust was the breakthrough". Or am i paranoid about it..
> But title makes it sound like "switching to Rust was the breakthrough". Or am i paranoid about it..
Since titles keep changing on HN (for good reasons), I can’t be sure you saw the same title that I did, but if you did I disagree.
The current title is “How Rust 1.26 more than tripled the speed of my code”.
Because that title specifies a specific version number, I expected the blog post to be about exactly the kind of thing it was; some new feature or improvement in that version of Rust which would make their software run faster compared to the previous version of Rust they were using.
The original title is the more precise "How a Rust upgrade more than tripled the speed of my code". Also I think mentioning the version made it clear that it's about a specific version, not about rust in general.
Happy to see alternative language features reducing the need for inline assembly, not just with the 128-bit integer support in 1.26 but also with the stable SIMD support coming in 1.27 (so crates like https://users.rust-lang.org/t/jetscii-now-works-with-future-... can at last work on stable). Sadly the current implementation of inline assembly is so inextricably tied to LLVM that there's resistance to stabilizing it in its current form, though in this case I think the perfect is being allowed to be the enemy of the good^W good-enough^W janky-but-acceptable.
What other factors played a role in your decision to use rust over other languages when 128bit arithmetic played such an important role you wrote inline assembly for it?
Higher-precision arithmetic is important but is not our primary product. The existing code was performant enough for our usecase (although the increased speed is helpful). Rust has many benefits and I find it hard to imagine how anyone manages to get any performance-sensitive work done in the cryptocurrency industry with anything else.
From just reading the discussion so far, I'd like to see how a naive implementation in Julia would do (Julia does type inference, and some quick duck-duck-ing indicates that unsigned 128 bit int is the biggest regular number type - before having to go to bigint).
[ed: ok, from a skim, I see this is actually about 256bit multiplication - which makes me curious how just using bigints and * (mul operator) would work in Julia.
Also, I don't get this:
> u256_mul multiplies two 256-bit numbers to get a 256-bit result (in Rust, we just create a 512-bit result and then throw away the top half but in assembly we have a seperate implementation)
How do they know the result will fit in 256 bits? Sounds like they know more about the arguments than both having to be 256 bits long?
Julia has `widemul` which takes two `Int64` and produces an `Int128`. It is also not to hard to actually add your own primitive type for `Int256` (with the caveat that it needs support from LLVM, which I haven't checked)
C has had strong support for inline assembly for a long time, and in fact for a short period we called out to C in order to use inline assembly on stable (the function call overhead was worthwhile compared to how much faster the assembly implementation of the function was).
If you use Rust's equivalent of `-march=native` you get faster code that can target modern hardware. We would have to get another factor-of-two speedup for the maintainance burden of using inline assembly to be worthwhile.
>”When Rust does a * b when a and b are both u64 the CPU actually multiplies them to create a 128-bit result and then Rust just throws away the upper 64 bits.”
Is that right? Wouldn’t that answer look like nonsense? I would have guessed it tossed the lower 64bits so it looked like rounding by truncation at least.
That's how floats work, not ints. Unsigned integers more or less always wrap on overflow, which is what you want, generally speaking. Essentially, the operation is carried out modulo the size of the type.
For signed integers, overflow is undefined in C/C++, but generally results in wrapping as well.
Unsigned and signed overflow in rust are both well defined to do one of two things, panic or wrap. If debug assertions are enabled they are defined to panic. In practice if debug assertions are not enabled they will wrap, but I believe that is technically subject to change (into a panic). It will realistically only change if the hardware gets much better at catching overflows quickly.
A panic is usually like a exception that you are really not expected to ever catch except at thread boundaries. You can turn them into a immediate process exit with a compiler flag.
Yeah, but I interpreted the question as "is that how it should work?", and then it's useful to compare with different languages. My point was that this is standard behaviour, but there's a slight quirk in C when the ints are signed.
Well semantics for a number that's too big to fit into the result can be debated, so there's no one right answer and all are nonsense in some way or another.
You talk about rounding but if the number's too big then then you're only ever going to round to the max number you could use, so we might as well call that saturation.
Discarding the upper bits means you are implementing modular arithmetic.
Both saturation and modular arithmetic are well defined, and can both be reasonable choices. I think most languages implement modular arithmetic, so that's what most programmers expect, and it seems a reasonable choice to me compared to saturation.
For example, many language runtime libraries have pseudo random number generators (PRNGs) use a 32-bit linear congruential generator (LCG). Suppose you want to create a rand(n) function which returns an integer between 0 and n-1. One good way of doing this is multiplying the lcg_output (a 32-bit unsigned integer) by n, and returning the high 32 bits.
It's better to do this than a modulus operation because the low bits on LCG sequences are highly predictable.
(This is how the Delphi Random(n) function works.)
By the way, it's only on x86-64 that the CPU's multiplication instruction always returns both the upper and lower 64 bits (in two different registers).
On, say, ARM64, there are two separate instructions: MUL is the normal one that takes two 64-bit inputs and produces one 64-bit output (the lower half of the product), while UMULH ("Unsigned Multiply High") performs the same multiplication but gives you only the upper half. This way, in the common case where you don't care about the upper half, the CPU doesn't have to waste time calculating it.
I wouldn't say that it's only x86. MIPS does the same thing, but doesn't even stick the result in the GPRs, you issue a multiply, and the result ends up in the hi and lo registers that you have to mfhi and mflo to move back into GPRs.
An unnecessary distinction as it can quite be easily inferred from the current title. Despite this there seem to be a lot of people nitpicking on semantics of the title in this thread. It's unnecessary noise and detracts (and distracts) from the real conversation.
It's not just this thread. HN is absolutely obsessed with nitpicking titles and yes, it's distracting and detracts from actual discussion of the topics.
The moderation team here tacitly encourages it by catering to it. I've seen some threads go through 2 or 3 title changes as people find something new to complain about.
Really it's almost a form of editorializing in itself - the title of the article is what it is, people here just think they know better than the author.
The disassembly for the Rust output of `u256_full_mul` at the end is also terrible, and it looks like the output of a debug build. There's no reason an optimizing compiler should be outputting `pushfd` and `popfd` in performant code. The benchmarked loop does not have those instructions.