Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Ideal divisors: when a division compiles down to just a multiplication (lemire.me)
141 points by chmaynard on April 28, 2021 | hide | past | favorite | 67 comments


If you've never seen it before, this is a beautiful blog post on related matters: "Optimising pointer subtraction with 2-adic integers" http://blog.sigfpe.com/2010/05/optimising-pointer-subtractio...


A few months back I wrote a blog post very much inspired by Dan’s post, though a little more academic: https://kevinventullo.com/2020/12/21/2-adic-logarithms-and-f...

The idea is to exploit deeper structures in p-adic number theory to develop useful and novel algorithms with unsigned integer arithmetic, e.g. fast integer exponentiation. Feedback welcome!

(Warning: math heavy)


This was a nice post, thank you! I don't yet have an intuition for how to think about these 2-adic exponentials and logs (e.g. what does it mean to say that log_2(5) =6713115954038056572) but clearly the mathematics works out and even leads to faster code, so it's very cool and I hope to read it again and understand it better.

(As a sidenote on the motivation, I do wonder how common/useful it is to compute x^y mod (2^64) for random 64-bit integers x and y (so they'd be close to 2^63 on average)... from this perspective, the promised future post about computing the k-th root of a perfect k-th power integer sounds very interesting!)


Thanks, that serves as good motivation to complete that post!

As far as the statement that

  log_2(5) = 6713115954038056572,
really that number is only an approximation; it is the last 64 bits of the “true” log_2(5), which has infinitely many bits. That is, the above equation in binary would be

  log_2(5) = 101110100101001110001110110010010000000011100100010011001111100.
In the true 2-adic world,

  log_2(5) = ...101110100101001110001110110010010000000011100100010011001111100,
where there are infinitely many leading bits.

(Classical positive integers embed into the 2-adic world as numbers with all but finitely many bits set to 0, and similarly classical negative integers have all but finitely many bits set to 1.)


> I don't know how gcc generates its approximate 2-adic reciprocals. Possibly it uses something based on the Euclidean GCD algorithm. I wasn't able to find the precise line of source in a reasonable time.

It's 0x1FFF...F / 7 = 0x249...249249249.

Just about the oddest thing I've seen; I guess, if I were the author, it'd be my "1 in 10000" day?


Makes more sense in binary:

  0xFFF..F 111111111111111111111111
  0x7      000000000000000000000111
  0x249..9 001001001001001001001001


I think you're trying to explain what's special about the number -1227133513 or how it comes about / how one could arrive at it. This is something the author understands very well; in fact that's what the whole post is about (well, it's about more advanced matters, using this as a "hook"), and many nice methods to arrive at that number are discussed in the post and comments. (And BTW your computation corresponds to 1227133513 or -1/7, not to -1227133513 or 1/7.) The sentence you quoted is merely asking which of the many discussed methods gcc is actually using, and this cannot be answered by mathematics; it needs to be answered by pointing to the relevant source code.


For a total n00b, could you explain why subtracting a char is a division?

“It doesn't seem like there could be much to say about that. The A structure is 7 bytes long so the subtraction implicitly divides by 7. That's about it.”


(Note that the subtraction is between two pointers to "struct A", not two chars. In the snippet "char" is used simply because it's 1 byte; one could also use int8_t, assuming a byte is 8 bits long.)

It's a fact about how the C language does pointer arithmetic.

For instance, suppose you have an array of many 64-bit integers, and a pointer to one of the elements of the array:

    int64_t arr[100];
    int64_t *p;
    p = &arr[50];
Now, the value of p is a certain address in memory. As each element of the array takes 8 bytes (64 bits = 8 bytes), the address of the next element of the array is 8 bytes past the location of p. In some languages (including assembly), you have to refer to this address by manually adding 8 to the address stored in p. In C, to refer to the next element of the array, you write "p + 1", and the compiler translates that into adding 8 bytes.

Conversely, if you give the new pointer a name:

    int64_t *q;
    q = p + 1;
then you expect (q - p) to be 1, not 8 (even though the value stored in q is 8 more than the value stored in p).

That is, when subtracting pointers (of the same type), the result implicitly divides by the size of the type. In the code example in the blog post, as the size of "struct A" is 7 bytes, subtracting two pointers to A will subtract the memory addresses and divide by 7.

If that wasn't clear, see this sentence on Wikipedia: https://en.wikipedia.org/w/index.php?title=Pointer_(computer... or the answers to this question: https://stackoverflow.com/questions/3238482/pointer-subtract... or play with it yourself here: https://godbolt.org/z/M6PEWYeYx


Ok, I’ve looked at this about five separate times and think that I finally get it. Thank you so much.


35 / 7 = 5

35-7 = 28, 1

28-7 = 21, 2

21-7 = 14, 3

14-7 = 07, 4

07-7 = 00, 5

thus we can see that in essence, division is repeated subtraction in some contexts, especially computationally.

An easy way to count the amount of X that fits into Y is to simply count the amount of times you can subtract X from Y before Y is a negative value


See also https://libdivide.com/ , when you need to repeatedly divide by a constant that isn't known until runtime.


Honestly if you're not familiar with the nitty gritty details of division, this sounds similar to those nodejs packages for "left padding a string"


The key here is "at runtime".


Yes. The funny part is that, as I said if you're not familiar with the nitty gritty details of division, you might think the library just exposes a function (a,b)->a/b, which works at runtime.


>I call the corresponding divisors ideal

Not a particularly good name, as “ideal” already has an established meaning in algebra: https://en.m.wikipedia.org/wiki/Ideal_(ring_theory)


Didn’t “ideal” have an established meaning in English before ring theory existed? Curious, why’d you pick that specialization of this word, out of all that are available? https://en.wikipedia.org/wiki/Ideal


When shown on a 7-segment display, the only numbers whose images don't contain a little square are 1, 2, 3, 4, 5, and 7. I call them "square-free numbers".

Oh, you're saying "square-free" has an established meaning in mathematics? https://en.wikipedia.org/wiki/Square-free_integer

Well, you see, "square" already had an established meaning in English before number theory existed. Curious, why’d you pick that specialization of this word, out of all that are available? :)


> “square” already had an established meaning

Wasn’t your straw man example ‘square-free’, not ‘square’...?


It wasn't a straw man :) "Square" and the "-free" (e.g. sugar-free) suffix both have established meanings in English.


Of course it was a straw man, but you know that. ;) So your point now is that you’ve changed your mind and conceded that narrow specific technical terms with uncommon meaning that few people know and contain commonly used words have no bearing on whether people should use those common words elsewhere? It sounds like you’ve withdrawn your objection to using “ideal” in the term “ideal divisors”? Great!


No, I did not change my mind, and it wasn’t a straw man :)

Unfortunately it is you who seem to be lacking the requisite knowledge to evaluate the merit of my argument…

Suffice to say, the various divisibility-related properties of integers are very closely related to ring theory. In the ring of integers (called Z), numbers of the form n*i for all i form an ideal (called nZ). As you can see these are the numbers divisble by a given n.

Thus in the context of integers and divisibility, the term “ideal” is not some niche jargon, but in fact a crucial concept that anyone with a rudimentary understanding of number theory (which is very closely related to abstract algebra) have encountered.

Of course not all of us software engineers have encountered these concepts as part of our formal education, but I would expect some familiarity from the alumni of any semi-decent software engineering college or university.


It's not an ideal name, even though it's ideal.


This is computer science, not algebra.


algebra plays a huge role in CS, e.g. cryptography


Genuine question: why would any compiler bother to optimise division with such obtuse dividends, other than a purists' satisfaction ?


There's no specific optimisation for these specific numbers. There's a more general optimisation to turn division by a fixed number into a multiplication and shift where possible (because division is a lot slower than multiplication), and there is another optimisation that eliminates useless shifts. Combined they yield this result for these specific numbers.


As an example of applicability, I used reciprocal multiplication and shifting, plus knowledge of how many bits of accuracy I didn't need, to get my old Kinect code for my former startup to run near 30fps instead of 2-3fps on hardware with no FPU and IIRC no divide instruction.


Yeah, a more detailed description of this general method seems missing from the article. Anyone care to provide it?



So basically calculating the multiplicative inverse of a number and storing it as something like a float (digits + exponent) with the digits and exponent stored separately to avoid overflow, underflow, accuracy issues.

And it's faster if and only if you're repeating division by a single number since you're still doing the equivalent of divide to get the invest (but repeated division is actually quite common so this is useful).


> And it's faster if and only if you're repeating division by a single number since you're still doing the equivalent of divide to get the invest

This is done at compile-time by the compiler. We're talking about division by a constant, not by a variable. So it's always faster.


Doing compile time is one way it is useful and I think that's the most common thing - that's why the tutorials show the resulting assembly instructions.

However there are other ways the trick can be used. If the compiler (or the programmer) determines that the program will be doing repeated division by a single quantity (say in a loop), you substitution a calculation of the multiplicative inverse of the quantity first and then substitute multiplication by that inverse for the repeated for the divisions. So you can extend things beyond just division by a constant.


The book "Hackers Delight" goes over a lot of these kinds of software algorithms for integers.


Division is orders of magnitude slower than multiplication.


Less than one order of magnitude these days. On Tiger Lake, integer multiply is 3 cycles, while integer division is 12 for 32-bit integers and 15 for 64-bit.

Of course, the multiply is fully pipelined while the division is only partially, but even that leaves only a 10-fold difference.

Division is much faster these days than it used to be. It's still worth it to have compiler optimizations for it, but it's not important to avoid it anymore.


> Less than one order of magnitude these days.

> only a 10-fold difference.

A ten fold difference is literally one order of magnitude. Which is what the post you're replying to factually stated.


Latency is less than 10-fold. The difference only goes up to one order of magnitude if you pack tons of divisions back to back. And the post I was replying didn't state "an order of magnitude", it stated "orders of magnitude", which are different things.


orders usually means 2 or more


This depends on context. more often (in my experience) people are using base 10 , but also sometimes 2 or e.

Even still people (unfortunately) use "orders of magnitude" like "exponential growth", signifying something like "wow, a lot, but no idea how much".


And for astronomers an order of magnitude can sometimes be 100^(1/5) =~ 2.512.

I hate it, too, when people call a quadratic curve "exponential".


Also the actual values were 4-5x (12 or 15 versus 3 cycles), adjusted up to 10x for pipelining.


10X is one order of magnitude, 100X is two.


Indeed...and 4 to 5x is less than that! My clumsily-made point was that there were two sets of numbers in the GP, and the smaller is unambiguously less than one order of magnitude different.


Cortex-M4 (think $2) has 1 cycle multiplication and 2-12 cycle division (with early termination). If you go down to Cortex-M0 (think $0.50) then you get software division which is indeed orders of magnitude slower. Of course lower down the line there isn't even hardware multiplication.

Interestingly, the ARM11 in original RPi and RPi0 doesn't have hardware integer division (despite being 10x faster than typical M4 and typically having 10000x RAM), but it does have floating point division.


Embedded systems programmers don't always have such luxury!


Why do some people assume decimal ten as the base of the orders of magnitude when discussing binary arithmetic?


Timespan/cycle count is almost always denoted in base 10 in conversation.


They're often special cases of general algorithms. You can do many divisions as combinations of multiplications & shifts; in this case, the resulting code simplifies down to the smallest possible.


For floating point math is this why I see "/2" written as "*0.5" all the time, and is that really necessary and can't compilers figure that out?


They don't necessarily produce the same result and strict IEEE conformance prohibits such transformations. With GCC, --ffast-math enables optimizations like this.


In this specific example, they do produce exactly the same result:

`a / b` is completely equivalent to `a * (1 / b)` when both b and 1/b are perfectly representable in floating point without rounding, which is the case for 2 and 0.5


That makes sense. I typically have to worry about larger relative tolerances anyway so I think I can start using that trick.


In JavaScript (specifically node / V8) if dividing multiple values by the same divisor, is it fastest to calculate the reciprocal and then multiply it?

How about other languages (and compiler systems)?

Are there binary tricks worth knowing about to quickly calculate reciprocals?


It's typically faster to calculate a reciprocal (especially at compile time) and then multiply it, but this is often at the expense of precision. In C++, for example, the -O3 flag on gcc will still emit a division (divss) to preserve compliant floating point precision in this example: https://godbolt.org/z/PY3Wjno4P

However, when we enable `-ffast-math`, we permit the compiler to do a bit more aggressive optimization in this domain (trading off precision), and we see multiplication (mulss) is emitted: https://godbolt.org/z/KEb1nc43z


> Are there binary tricks worth knowing about to quickly calculate reciprocals?

Yes! You can calculate an approximate reciprocal with a single subtraction and some type-casting or bit manipulation.

The accuracy isn’t great (relative error of up to ~5%), but it might be good enough for many applications. The method is famous for being used to normalize 3d vectors in Quake https://en.wikipedia.org/wiki/Fast_inverse_square_root

This same method for calculating 1/x^2 can also be used to calculate 1/x by using a different magic constant and skipping the shift-right by one bit.


i'd like to find these numbers for base9 vs base10. see https://stackoverflow.com/questions/67226727/is-there-a-way-... if you look at that sample golang code, is that the right way to go about this? Or is there a much simpler way already built into a high level language like go?


Floating point, while convenient does have substantial drawbacks in the implementations of circuits that we have today. Efficiency gains like this somewhat expose facets that are normally hidden. It will be interesting to see the transition to either analog NN or quantum computers, which inherently have better representations of continuous values.

Edit: Additionally the serial vs parallel processing paradigm shift is a main advantage of the switch too! Queued systems seem to have severe scheduling limitations, which the real-world does not always inherently have.


The linked article is about integer division and has nothing to do with floating point, and in fact I can't see how this idea could be applied to floating-point arithmetic. Can it?


Not directly, no. The underlying mathematics kind of can, but you'd need dedicated circuitry (or bit-hacking) for it because of the floating point parts, and it would fall over around subnormals.


I’m not certain, but while reading I was feeling like there potentially is a relationship. It was reminding me of the algorithm where you can approximate a reciprocal square root using a single subtraction from a magic number constant. https://en.wikipedia.org/wiki/Fast_inverse_square_root


I have been seeing this Wordpress theme a lot lately and it is so bad the way the content is pushed so far to the right. On my 27" monitor the left margin of the text is essentially the center of the display.


One thing that often works in this situation with chrome is the "Emulate CSS Media Type: Print" option from the rendering tab in Dev Tools. The theme appears to be the wordpress default theme from 2015, so you should see less of it as times goes on.


I have a 27" monitor also, and it looks fine to me.


How about if the browser window is maximized?

What I get in that case is 23 cm for the darkish blue left sidebar and 36 cm for the light blue right area that contains the article. Within that right area, the article is in 19 cm white rectangle with just under 2 cm margins.

So, starting from the left here is what I see:

15 cm empty dark blue background,

6 cm of sidebar text on dark blue background,

2 more cm of empty dark blue background,

2 cm of empty light blue background,

A tad under 2 cm of empty white background,

About 14.5 cm of text on white background,

A tad under 2 cm of empty white background,

15.5 cm of empty light blue background.

It really is quite ugly. It looks like something I would do because I completely suck at CSS.


I almost never¹ run a browser maximized on anything bigger than my 16" laptop screen (and even that not so much). Most of the time, I have browser windows that are 1/2 (on my 24" monitor) or 1/3 (on my ultrawide monitor) of the width and the full height (2/3 width on my laptop)).

1 The exception being browser windows to run sites like Netflix where I'm really using it as a video interface.


27" monitor on Chrome here. Looks pretty far to the right.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: