Improving Floating Point Accuracy: A Look at Sums

rjmunro · on Oct 19, 2015

More important than just increasing accuracy is knowing how much accuracy you have. With interval artihmetic, you keep an upper and lower bound of every calculation, and use the the appropriate floating point rounding modes when you manipulate the numbers.

You can then try different methods of calculating your result and measure which is the least destructive - compiler optimisations can optimise for width rather than runtime.

Some methods might produce a better upper bound, and others a better lower bound. In those cases you can take the intersection as your best result. See https://pypi.python.org/pypi/pyinterval for an implementation in python.

eru · on Oct 19, 2015

Don't common ways to implement interval arithmetic massively overestimate the inaccuracy?

rjmunro · on Oct 27, 2015

Compared to what?

You can improve accuracy by optimising the way you do your calculations. A really trivial example: 10000 + .00001 - 10000 is less accurate than 10000 - 10000 + .00001. Without interval arithmetic, you can't easily measure this and decide which way round is best.

acqq · on Oct 19, 2015

It just uses:

https://en.wikipedia.org/wiki/Kahan_summation_algorithm

And Wikipedia has also a discussion of how some compiler optimizations can optimize the algorithm implementation wrong.

edejong · on Oct 19, 2015

Indeed. My advice is to use pairwise summation [0], which is easier to implement, does not suffer from compiler optimizations and is, for normal means and purposes, sufficiently precise. Previously, I considered a small blog-post, but it would be an (incomplete) copy of a Wikipedia article.

[0] https://en.wikipedia.org/wiki/Pairwise_summation

alexSanchStern · on Oct 19, 2015

Pairwise summation is definitely a useful technique, depending on your situation. However, it does have the key limitation that you need to have all your data beforehand. There are quite a few cases in practice where you have streaming data, and Kahan summation handles this just fine. In addition, if your data is really unbalanced in it's magnitudes, pairwise summation can still lose a fair amount of precision compared to Kahan summation. There are definitely cases where you'd want to use one, and cases where you'd want to use the other.

madez · on Oct 19, 2015

The argumentation in the article is bad at some points. First, it praises the fast execution of pairwise addition with a large base case. Second, it praises pairwise addition as accurate with O(E) in the base case. Following this argumentation, making the base case as big as possible would be optimal. However, that would result in naive summation in practice. Thus, the argumentation is bad.

bazzargh · on Oct 19, 2015

It looks like a bad edit to the wiki.

https://en.wikipedia.org/w/index.php?title=Pairwise_summatio...

^ this user changed the error bounds, incorrectly. The worst case for the naive sum is pretty obviously epsilon*N (and is described that way on the Kahan sum page)

wiml · on Oct 19, 2015

I may be missing something, but since the error accumulation variable "c" is also of finite width, isn't this the same as simply computing your sums in double precision and then truncating? A pessimal input can still cause cancellation in both sum and c and leave you with an arbitrarily large error.

benjaminl · on Oct 19, 2015

That is what I was thinking the whole article. Seem like not only would that be simpler it would also be faster. I assume that there is a reason for employing this, but I couldn't find it. Looking at the paper A Comparison of Method For accurate Summation by McNamee[1], double precision beats Kahan summation, both in speed and reduction in errors.

Moving to double precision of course only holds if your variables were previously only single precision. If your variables are already double precision, I could see the reason for implementing more exotic methods.

[1] http://www.sigsam.org/bulletin/articles/147/sumnums.pdf

alexSanchStern · on Oct 19, 2015

Yes, that's completely true. But while there are still inputs for which terrible accuracy could result, for many of the inputs which would arise in practice, the Kahan summation technique is "good enough" to prevent loss of accuracy. The chances of getting an input for which catastrophic cancellation would result on the doubled precision are far overshadowed by the chances of getting inputs for which the normal summation would have unacceptable loss accuracy, but Kahan summation would not.

raymondh · on Oct 19, 2015

Even higher accuracy is possible by storing all partial sums. Here are several ways to achieve higher accuracy:

http://code.activestate.com/recipes/393090-binary-floating-p...

Also see Shewchuk's "Adaptive Precision Floating-Point Arithmetic and Fast Robust Geometric Predicates" at http://www-2.cs.cmu.edu/afs/cs/project/quake/public/papers/r...

SagelyGuru · on Oct 19, 2015

Using rational numbers is another good idea how to combat/reduce floating point errors. For example, you can represent and calculate with 1/3 exactly, whereas in binary (floating point) you can not, no matter how many bits you use.

recursive · on Oct 19, 2015

In a case where it's likely to matter, you probably have a list of many numbers with diverse denominators. Certainly you can achieve mathematical precision with rationals (if you can get them in the first place) but the trade-off is ending up with really huge big-ints as numerator and denominator.

SagelyGuru · on Oct 19, 2015

Good point. To reduce this problem you can (and should?) factorise and cancel out common primes. Just keeping to a pair of 64 bit integers will give you a pretty good accuracy. You can perform all arithmetic directly on the rationals and represent reals by the nearest rational approximation (after all, a float is an approximation too in those cases, just a poorer one).

Stern-Brocot trees are a good systematic way of finding the best rational approximations and also for doing interval rational arithmetic.

mollmerx · on Oct 19, 2015

If the list is known in advance, what about a strategy of sorting the items by absolute value and then summing from smallest to largest?

Would this improve accuracy? If so, by how much?

dbaupp · on Oct 19, 2015

That will ensure that, say, 1e20 + 1 + 1 + ... + 1 (with 1e20 ones) is correctly computed as 2e20 (edit: this isn't actually true: it just gets a bit closer to 2e20, something like 1e20+1e16), but it can still result in catastrophic cancellation, e.g. [1, 1e20, -1e20] should sum to 1, but will give 0, despite already being sorted.

Something like the msum of [1] is needed to get a (provably[2]) exact summation.

[1]: http://code.activestate.com/recipes/393090-binary-floating-p... [2]: http://www-2.cs.cmu.edu/afs/cs/project/quake/public/papers/r...

Devid2014 · on Oct 19, 2015

Using something else as float or double would decrease performance significantly. Using array sorting would not only be solve too but actually do not improve precision that much at all.

Using double to sum floats has comparable precession as Kahan summation. But of course it is not possible to do the same to summing doubles. Or if there is no proper double precision available at all like on GPU.

__mp · on Oct 19, 2015

In my case I'm more interested in the positions where floating point accuracy is lost: I'm currently working on a huge code base where we moved from double to single precision. If a unittest does not validate in single precision I want to have a tool telling me where and why.

acqq · on Oct 19, 2015

It seems to be a very wrong decision moving a "huge" base from double to single.

The sane approach is to move only that which must be moved, and there must be a valid reason for the change. Moving "huge" seems wrong by default, if the subject is numeric. It's not comparable with renaming your variables or methods or whatever you can do on the huge base and still keep the result the same.

dnautics · on Oct 19, 2015

That's not a trivial question to answer. I suggest reading about unums in John Gustafsson's "the end of error". It includes nice failures of floats.

Disclaimer: I'm working on an implementation

irascible · on Oct 19, 2015

Would doubling the size of the floats used in the computation achieve a similar result for the example presented?

acqq · on Oct 19, 2015

Absolutely, even if you sum floats, if you care about accuracy you should sum them using double. I've read the discussion of Julia implementers that have discussed Kahan summation but strangely didn't pick the simpler approach of not using floats for partial sums of floats. Albeit it was an old discussion and I don't know what's the present state.

alexSanchStern · on Oct 19, 2015

Even when summing using doubles, you can lose precision. In general, Kahan summation allows you to double the intermediary precision of your sums, so if you're losing precision even with 64-bit doubles, Kahan summation can give you 128-bits of intermediary precision, without going to software floating point solutions.

acqq · on Oct 19, 2015

That's true, however it's saner, safer and faster to simply use doubles to sum floats than just use floats and Kahan's summation, and surprisingly some people don't get that.

legulere · on Oct 19, 2015

The probably most easiest way to increase floating point accuracy is to use decimal floating point. The input data usually is in decimal and needs only a small part of the accuracy that a decimal float format offers. Using them later in calculations will make them use more accuracy, but the calculations will be exact for much longer.

When converting decimal to binary floating point numbers you will often use the full accuracy of the float format because the decimal floating point numbers can't be represented exactly in binary.

dbaupp · on Oct 19, 2015

Decimal floating point just changes how cancellation/rounding appears, it doesn't do anything to increase accuracy. E.g. suppose you're using a decimal float with 1 digit of precision (i.e. it can have 1.0, 1.1, ... 9.9 as the mantissa), and compute 100 + 1 - 100 (i.e. 1.0e2 + 1.0e0 + -1.0e2): 100 + 1 is 101, which rounds to 100 (1.0e2), and 100 + -100 is of course 0. Complete catastrophic cancellation.

In fact, binary floats use their storage better than decimal floats, so a binary float of a given size will have slightly higher accuracy than a decimal one of the same size. For example, there's 90 mantissae to encode in the example above, requiring a minimum of 7 bits, with a machine epsilon of 0.01. Using those 7 bits for a binary float gives an smaller epsilon: 1/128 ≈ 0.0078.

Of course, decimal floats do have the advantage of matching a common input/output format, and their rounding artifacts better match the intuition of us base-10 accustomed humans.

legulere · on Oct 19, 2015

> In fact, binary floats use their storage better than decimal floats, so a binary float of a given size will have slightly higher accuracy than a decimal one of the same size.

Not necessarily. The inefficiency is coped with by using less data for the exponent (which thus offers a smaller range). 64 bit decimal floats actually even have more data for the mantissa than their 64 bit binary counterpart.

> Decimal floating point just changes how cancellation/rounding appears, it doesn't do anything to increase accuracy. E.g. suppose you're using a decimal float with 1 digit of precision (i.e. it can have 1.0, 1.1, ... 9.9 as the mantissa), and compute 100 + 1 - 100 (i.e. 1.0e2 + 1.0e0 + -1.0e2): 100 + 1 is 101, which rounds to 100 (1.0e2), and 100 + -100 is of course 0. Complete catastrophic cancellation.

It also changes how often it occurs. With binary floats it already happens when you do 0.1 + 1.0 (I don't mean the rounding error of converting 0.1 to binary).

When you enter 1.0 + 0.1 - 1.0 e.g. in python you will get 0.10000000000000009 as a result which is not equal to what you get when you enter 0.1. With decimal floats this doesn't happen.

With every calculation you do the amount of precision needed can rise. With decimals converted to binary floats you use all of the precision right from the beginning. With decimal floats you can often stay easily within the precision offered by the data type.

k0htlane · on Oct 19, 2015

> The input data usually is in decimal and needs only a small part of the accuracy that a decimal float format offers.

I would actually argue that majority of the time data is not in decimal. For example in computational science, for which the Herbie seems to be aimed at, you very rarely have decimal numbers. While input parameters to computations might be a decimal numbers, everything else apart from initial conditions would be irrational numbers. A good example would be numerical solving of harmonic oscillator equation -- the initial conditions might very well be decimal, but the numerical solution is not (neither would be analytic solution).

> When converting decimal to binary floating point numbers you will often use the full accuracy of the float format because the decimal floating point numbers can't be represented exactly in binary.

Inability to exactly represent decimal numbers isn't really the problem in these cases. Summation of numbers with wildly varying magnitudes would be problematic for decimal floats as well.

legulere · on Oct 19, 2015

> I would actually argue that majority of the time data is not in decimal.

The majority of time data is not even floating point and for most use cases floats don't make sense there (you usually have a set precision you want to have and don't have varying orders of magnitude)

Almost all operating systems and programming languages use metric fractions of the second for time (https://en.wikipedia.org/wiki/System_time#Retrieving_system_...) the only common technology that I know that doesn't is NTP.

> in these cases

I was leaving a more general remark