Hacker News new | past | comments | ask | show | jobs | submit login
How to Print Integers Really Fast (pvk.ca)
169 points by signa11 on Dec 24, 2017 | hide | past | favorite | 46 comments



A fairly common special case of printing integers is printing all integers from 0 to N. In that case you can use a trick: Keep a string representation of the current number and increment the string representation.


Anyone who has played the old arcade games with fixed-width score counters will have seen this trick in action --- many of those ran on 8-bit processors without multiply or divide instructions, so it was much easier to keep the score as an array of digits and increment those than a binary variable and the equivalent of an itoa() every time it needed to be displayed.

More succinctly, in pseudocode:

    while(++digits[i] > '9')
        digits[i--] = '0';
By starting the increment at the non-rightmost digit, it also allows for easy addition of powers of 10: 10, 100, 1000, etc.


Related to this is binary coded decimal [1]: let each nibble (4 bits) of your integer represent a decimal digit 0-9. Then do your arithmetic in binary, but after each operation run a special CPU instruction to "correct" the result to what it should be in decimal.

[1] https://en.wikipedia.org/wiki/Binary-coded_decimal


There are two BCD formats. Packed (what you describe) and unpacked (where only a single digit is stored per byte). Unpacked seems to have the advantage that converting to ASCII is just a matter of setting two bits, and converting from ASCII is just a matter of masking those out.

I found this example very illuminating

    sub    AH,AH    ; clear AH
    mov    AL,'6'   ; AL := 36H
    add    AL,'7'   ; AL := 36H+37H = 6DH
    aaa             ; AX := 0103H
    or     AL,30H   ; AL := 33H
because aaa instruction clears the high nibble of AL, it works with either unpacked BCD or with ASCII (but always outputs BCD).

http://service.scs.carleton.ca/sivarama/asm_book_web/Instruc... (page 8)

But iiuc aaa didn't make it in the transation to 64bit processors.


Some CPUs like the Saturn from Hewlett-Packard [1] are designed for BCD. Though they can perform binary operations, registers are structured for 64-bit floating point operations with a 4-bit wide bus and nibble-granularity addressing of register nibbles: 1 sign, 12 mantissa, 3 exponent.

[1] https://en.wikipedia.org/wiki/HP_Saturn


The Game Boy CPU kept track of nibble overflows and had an instruction called DAA to adjust the packed BCD value in the A register after an arithmetic operation. If you disable this instruction in an emulator many games start counting your score in hexadecimal (Tetris even has the sprites A to F interestingly, so you actually get your full score displayed correctly).


The Game Boy "had" this DAA instruction from its CPU, a variant of the Z80. This trick may well have been used also on some Amstrad CPC games also, then.

Indeed, the Z80 itself, which people like us know because it was very common in 8bit home computers, is an enhancement of the Intel 8080, which itself is an ancestor of the current Intel CPUs.

Incidentally, Z80 CPUs are still manufactured 40+ years later ( https://en.wikipedia.org/wiki/Zilog_Z80 ) and benefit from modern C compilers like SDCC and software development environments like z88dk or the one I wrote for the Amstrad CPC: https://github.com/cpcitor/cpc-dev-tool-chain (written for Linux and tools like GNU make, wget, etc).

That said, writing C code that compiles to packed BCD and the use of DAA at assembly level does not seem very practical. If few numbers are manipulated, unpacked BCD is simpler (closer to string, less code to handle) both in assembly and in C.


GNU seq uses this trick:

slow path:

    seq -f '%.1f' inf | pv > /dev/null
     ...[  12MiB/s]
fast path:

    seq inf | pv > /dev/null
     ...[ 491MiB/s]


Can you explain that a bit more? I know "seq", but am unsure what your examples are illustrating. Thanks!


-f specifies a format, so it can't use the specialized itoa implementation, but probably just uses printf.


This is a common case? Besides educational exercises, why? I would have guessed most printed integers are random ID numbers, prices, database foreign keys, etc.


> This is a common case? Besides educational exercises, why?

Sequential IDs permeate everything. Array indicies. Memory offsets. Line numbers, track numbers, page numbers, check numbers... the dates of your daily logs, your calendars... heck, enumeration attacks can be as simple as "what happens when I increment this number by exactly 1 when making this GET request?"


Line numbers in less(1) or GitHub. Row numbers in Excel, Pandas, R, etc.

The most common way to label rows is with sequential numbers.


FWIW, in a file with 1,000,000 lines, the best-of-3 time for "less filename > /dev/null" is:

  2.508u 0.152s 0:02.66 99.6%	0+0k 0+0io 0pf+0w
and the best-of-3 time for "less -N filename > /dev/null" is:

  2.568u 0.159s 0:02.73 99.2%	0+0k 0+0io 0pf+0w
That is, it doesn't seem like printing sequential is the limiting factor in performance.

This is with "less 458", "Copyright (C) 1984-2012". I downloaded and compiled stock 487 and the best-of-3 times went up to 0:02.94 for both cases.

Checking the source code, it does not appear to use knowledge of the previous output index in order to save time. The relevant code is:

        static int
  iprint_linenum(num)
        LINENUM num;
  {
        char buf[INT_STRLEN_BOUND(num)];

        linenumtoa(num, buf);
        putstr(buf);
        return ((int) strlen(buf));
  }
where

  #define TYPE_TO_A_FUNC(funcname, type) \
  void funcname(num, buf) \
          type num; \
          char *buf; \
  { \
          int neg = (num < 0); \
          char tbuf[INT_STRLEN_BOUND(num)+2]; \
          register char *s = tbuf + sizeof(tbuf); \
          if (neg) num = -num; \
          *--s = '\0'; \
          do { \
                  *--s = (num % 10) + '0'; \
          } while ((num /= 10) != 0); \
          if (neg) *--s = '-'; \
          strcpy(buf, s); \
  }
  
  TYPE_TO_A_FUNC(linenumtoa, LINENUM)


hmm, maybe someone can get a pull request for implementing this optimization :)


While I know you meant that comment in jest, it suggests that I wasn't clear enough.

My implicit argument is that there's there's no reason to believe that line counting adds anything more than trivial overhead to the less output, so there's no reason to even consider this optimization, much less getting to the point to make a pull request.


This is so common that there are dedicated ICs to count in base 10.


Not long ago I’ve benchmarked several implementations. The fastest itoa I’ve found was fmt::FormatInt class from there https://github.com/fmtlib/fmt

Based on the article, I think fmtlib will be faster than what they did here.


FormatInt doesn't write out to a buffer and doesn't pad the output with zeros, so it's going to lose out nontrivially in actual use.

I have tested some FormatInt-like code that does do those things; if I recall the results correctly, the article's is slightly faster with unpredictable sizes and significantly faster with predictable sizes.


Division is slow. I recently needed to send 16 bit integers over a UART and came up with something like this:

uint16_t n,d;

  // 10K digit
  d = ((uint32_t)n * 53687 + 8208) >> 29;
  n -= d*10000;
  // output d+’0’ 
  
  // 1K digit
  d = ((uint32_t)n * 2097 + 1368) >> 21;
  n -= d*1000;
  // output d+’0’
  
  // 100s digit
  d = (n * 41) >> 12;
  n -= d*100;
  // output d+’0’
  
  // 10s digit
  d = (n * 51 + 40) >> 9;
  n -= d*10;
  // output d+’0’
  // output n+’0’
It does scale to 32 bits but will require 64bit intermediates.


You’re probably using some exotic compiler.

Most C compilers already do something very similar, check this out: https://godbolt.org/g/weicFj

Obviously, "x / 10" is much easier to read and support, compared to these manual optimizations.


I'm not doing mod 10 operations, I'm printing left to right. I also put each digit in a case: and the digit printing is part of a larger state machine transmitting various things. No characters are stored, one of these blocks was run whenever the UART was ready to receive a new character. I didn't want to do the entire conversion to string all at once and this is a very fast way to extract digits in order.


Your “d = ((uint32_t)n * 53687 + 8208) >> 29;” is no faster than “d = n / 10000;” Modern compilers already apply these multiply-shift tricks when dividing by a constant.


I was writing for an ARM based micro controller. The compiler may well have been GCC. I should have timed it vs the obvious use of division. I was not aware that they would do such things for division by arbitrary large numbers. If that's all so, then why is TFA worried about the speed of itoa?


> I was not aware that they would do such things for division by arbitrary large numbers.

They don’t, just 10000 is not large enough :-) But yeah, apparently 32-bit 10k division works on GCC ARM as well: https://godbolt.org/g/ebJqrw

> why is TFA worried about the speed of itoa?

Because itoa is slow.

I once needed to read / write large (100MB+) G-code files. Profiler showed me it spent majority of time in standard library routines like itoa / atof / etc.

Replacing itoa with http://fmtlib.net lead to much faster code.


Would be interesting to see comparisons to the algorithms in that benchmark: https://github.com/miloyip/itoa-benchmark

What I generally wonder is, why ISAs don't have special instructions for a task that is so common.


> What I generally wonder is, why ISAs don't have special instructions for a task that is so common.

ISAs used to have instructions to support binary coded decimals (BCD). BCD encoding stores values as 0x[0-9][0-9], and the instructions provide math operations on that representation. On x86, these still exist, for example AAD: https://c9x.me/x86/html/file_module_x86_id_2.html

Other ISAs (eg 6502) had a processor flag to switch between regular and BCD mode.


> why ISAs don't have special instructions

IBM System/360 and its descendants actually does have instructions to convert between decimal and binary representations (CVD, CVB), decimal arithmetic and even instructions for formatting numbers (ED). I should also mention PACK and UNPK to convert between character and packed decimal number representation.

Today, of course, it is less important to have those instructions because computers are used less to crunch numbers and more to do other stuff.


I think we agree, but I would say computers are used _more_ to crunch numbers. Inside the system, everything is binary; you only need decimal to present data to humans.

With old time COBOL workloads, a lot of the stuff you do was presenting stuff to humans. The typical banking application, for instance, reads numbers, does a few additions, and shows them on a terminal or prints a report. Using fixed-point arithmetic was normal.

Nowadays, we do way more floating point arithmetic (for graphics, for machine learning)

Also, I think the reason those complex instructions aren’t created anymore is pipelining. I can’t find timing information for the ED instruction, but I think it is a safe bet it isn’t constant cycle count, let alone single-cycle. It not only writes a character buffer, it also has to read the bytes containing the pattern defining how to format the number.


What you are saying is only partly true. Yes, the packed decimal instructions are horrible today from performance perspective, because they require direct access to memory.

However, recently, IBM has added decimal floating-point arithmetic to z/Architecture CPUs (and also made it part of updated IEEE 754 standard). These modern instructions do not suffer from the problems of old instructions, and in fact should be preferred to packed decimal for applications that need decimal point arithmetic.

And more broadly, IBM still adds a lot of specialized instructions to mainframe CPUs. For example, calls to cryptographic accelerators.


ISA design is a highly quantitative business. You take a bunch of application benchmarks, compile them with you proposed new instruction, and measure speedup.

So the answer is one of:

1) people have, and decided against it.

2) people have dismissed it in an earlier phase, by observing that it doesn't make a big enough blimp in CPU profiles to warrant further investigation.

Also, it's hard to speed up very much in hardware because it's a bunch of interdependent divides which current cpu divide instructions are already good at. If the hardware acceleration interface is vector style, it's doubtful if many applications can easily take advantage of it. This article talks about printf, which certainly can't.

Do you have an area of applications in mind that spends a big slice of its computation time on integer -> ascii conversions?


> Also, it's hard to speed up very much in hardware because it's a bunch of interdependent divides

Most things which seem sequential are actually easily parallelized. This is one of them; all of the divides can be done in parallel, and since divisions are by constants it all falls into some fairly simple hardware.

The real argument is more that it's a bit pointless to hardware accelerate something that takes 10-20ns in optimised software if it's not used 100 billion times per day per core.


Do you have an area of applications in mind that spends a big slice of its computation time on integer -> ascii conversions?

I don't know if the situation has changed now, but a lot of the "big data" systems I worked with several years ago liked to use text formats (JSON, CSV, TSV, XML) for their input/output --- and I always thought it was a pretty stupid idea, since the data volumes were in the gigabytes and terabytes range.


They did have special opcodes in x86 but dropped it in x64:

https://en.m.wikipedia.org/wiki/Intel_BCD_opcode


...and more puzzlingly, those opcodes just became invalid, and weren't reused for something more useful in 64-bit mode. It reminds me of LAHF/SAHF and segmentation --- both features that AMD decided to remove initially for AMD64, but were forced to reintroduce after it was discovered that things like virtualisation needed them. Unfortunately, the same is unlikely to be true of the BCD instructions...


This benchmark, lut and divide&conquer lut, seems to be really slow compared to paul's new itoa. His biggest trick is to preallocate the target buffer size, and to avoid all this heavy branching.


> It’s a stupid problem to have–just use a binary format

Not really. A safe, efficient and future-proof binary format is not a simple problem to solve. At the very least, you'll need something like base-127 encoded digits.

At this point you might as well go with base 10 and have human readability as a side bonus.

P.S. "Just use protobuf" is not a solution either. It's only an option because you're using a pre-built library solution, and there's no apriori reason to pick protobuf over some other pre-built solution.


> P.S. "Just use protobuf" is not a solution either.

It's one solution, which makes it a solution. It also happens to be a good solution for 95%+ of data-interchange use-cases.

JSON is a genuinely terrible approach to interchange for most anything that you expect to grow to be nontrivial. Whatever your thought of static vs. dynamic typing for languages, schemaless interchange formats are absolute madness — an analogue to what people are rapidly learning about schemaless databases. "Schemaless" formats don't actually mean there isn't a schema. You have one, it's just informal and you lack any useful tools to manipulate it or make changes in the future.

What you gain in early development speed is lost many times overby not actually stopping to think about how your data will be modeled beforehand and it will be lost many times over again when you need to change your informal parsing logic (e.g., random hash accesses) that's spread across dozens of unrelated areas of your code and impossible to locate. As an added bonus, you often end up having to indefinitely support every buggy, half-baked version of this format going backward to the beginning of time.


JSON is terrible format for data storage. But for data exchange it is OK. The problem with schemaless storage is that as code evolve it is way to easy to forget to cover the needs of already stored but not presently accessed data. With communications this is much less a problem as serialization and parsing must evolve together with code.


You have this exact same issue with JSON for data interchange, unless there are exactly one producer and one consumer and you can deploy both simultaneously.

If you have more than one producer, more than one consumer, or you don't have the control to update all participants simultaneously, you are setting yourself up for pain.


At this point you might as well go with base 10 and have human readability as a side bonus.

How often does a human need to "process" the data, vs. a computer? I realise the environment is a little different with today's HLL programmers, but when you do need to, it's not as if reading hexdumps is all that hard either (unless it's something like ASN.1 PER, in which case you'll likely be using a tool to assist anyway.)

I've worked with many systems using custom binary protocols, and everyone on those teams pretty much knew how to read and write them directly from the hexdump. From that perspective, I'd say text-based formats are an unnecessary overhead for all use cases except those where (non-developer) humans are expected to manipulate the format directly and often. Another memorable quote I remember from a coworker when we had a similar discussion (long ago): "text can be ASCII or EBCDIC or whatever other crazy character set someone decides to use. A bit is a bit. 1 and 0 are unambiguous."


ASN was invented for this 30 years ago. Runs like protobuffers are just re-inventing it. Arbitrary sized numbers were in there from the start.


ASN.1 is also rather complex and the most widely used encoding rules have been shown many times to be rather difficult to implement correctly and securely [1]. X.696 (OER) is said to be better, but it is only around since a few years (!).

[1] A property shared by many IT industry standards.


> Not really. A safe, efficient and future-proof binary format is not a simple problem to solve.

Neither is a safe, efficient, and future-proof text format. If you use an existing text serialization, you now have effectively two parsers instead of one.


> I find Protobuf does very well as a better JSON

I assume this was for their ad bidding? I recall they sent JSON with a ridiculous number of integers for the arrays representing audience segments.


I've seen recently another version of itoa called itoa_jeaiii with good performance claims:

https://twitter.com/Atrix256/status/938994058348847106

Code at https://github.com/jeaiii/itoa/tree/master/itoa




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: