A fairly common special case of printing integers is printing all integers from 0 to N. In that case you can use a trick: Keep a string representation of the current number and increment the string representation.
Anyone who has played the old arcade games with fixed-width score counters will have seen this trick in action --- many of those ran on 8-bit processors without multiply or divide instructions, so it was much easier to keep the score as an array of digits and increment those than a binary variable and the equivalent of an itoa() every time it needed to be displayed.
More succinctly, in pseudocode:
while(++digits[i] > '9')
digits[i--] = '0';
By starting the increment at the non-rightmost digit, it also allows for easy addition of powers of 10: 10, 100, 1000, etc.
Related to this is binary coded decimal [1]: let each nibble (4 bits) of your integer represent a decimal digit 0-9. Then do your arithmetic in binary, but after each operation run a special CPU instruction to "correct" the result to what it should be in decimal.
There are two BCD formats. Packed (what you describe) and unpacked (where only a single digit is stored per byte).
Unpacked seems to have the advantage that converting to ASCII is just a matter of setting two bits, and converting from ASCII is just a matter of masking those out.
I found this example very illuminating
sub AH,AH ; clear AH
mov AL,'6' ; AL := 36H
add AL,'7' ; AL := 36H+37H = 6DH
aaa ; AX := 0103H
or AL,30H ; AL := 33H
because aaa instruction clears the high nibble of AL, it works with either unpacked BCD or with ASCII (but always outputs BCD).
Some CPUs like the Saturn from Hewlett-Packard [1] are designed for BCD. Though they can perform binary operations, registers are structured for 64-bit floating point operations with a 4-bit wide bus and nibble-granularity addressing of register nibbles: 1 sign, 12 mantissa, 3 exponent.
The Game Boy CPU kept track of nibble overflows and had an instruction called DAA to adjust the packed BCD value in the A register after an arithmetic operation. If you disable this instruction in an emulator many games start counting your score in hexadecimal (Tetris even has the sprites A to F interestingly, so you actually get your full score displayed correctly).
The Game Boy "had" this DAA instruction from its CPU, a variant of the Z80.
This trick may well have been used also on some Amstrad CPC games also, then.
Indeed, the Z80 itself, which people like us know because it was very common in 8bit home computers, is an enhancement of the Intel 8080, which itself is an ancestor of the current Intel CPUs.
Incidentally, Z80 CPUs are still manufactured 40+ years later ( https://en.wikipedia.org/wiki/Zilog_Z80 ) and benefit from modern C compilers like SDCC and software development environments like z88dk or the one I wrote for the Amstrad CPC: https://github.com/cpcitor/cpc-dev-tool-chain (written for Linux and tools like GNU make, wget, etc).
That said, writing C code that compiles to packed BCD and the use of DAA at assembly level does not seem very practical. If few numbers are manipulated, unpacked BCD is simpler (closer to string, less code to handle) both in assembly and in C.
This is a common case? Besides educational exercises, why? I would have guessed most printed integers are random ID numbers, prices, database foreign keys, etc.
> This is a common case? Besides educational exercises, why?
Sequential IDs permeate everything. Array indicies. Memory offsets. Line numbers, track numbers, page numbers, check numbers... the dates of your daily logs, your calendars... heck, enumeration attacks can be as simple as "what happens when I increment this number by exactly 1 when making this GET request?"
While I know you meant that comment in jest, it suggests that I wasn't clear enough.
My implicit argument is that there's there's no reason to believe that line counting adds anything more than trivial overhead to the less output, so there's no reason to even consider this optimization, much less getting to the point to make a pull request.
FormatInt doesn't write out to a buffer and doesn't pad the output with zeros, so it's going to lose out nontrivially in actual use.
I have tested some FormatInt-like code that does do those things; if I recall the results correctly, the article's is slightly faster with unpredictable sizes and significantly faster with predictable sizes.
I'm not doing mod 10 operations, I'm printing left to right. I also put each digit in a case: and the digit printing is part of a larger state machine transmitting various things. No characters are stored, one of these blocks was run whenever the UART was ready to receive a new character. I didn't want to do the entire conversion to string all at once and this is a very fast way to extract digits in order.
Your “d = ((uint32_t)n * 53687 + 8208) >> 29;” is no faster than “d = n / 10000;” Modern compilers already apply these multiply-shift tricks when dividing by a constant.
I was writing for an ARM based micro controller. The compiler may well have been GCC. I should have timed it vs the obvious use of division. I was not aware that they would do such things for division by arbitrary large numbers. If that's all so, then why is TFA worried about the speed of itoa?
> I was not aware that they would do such things for division by arbitrary large numbers.
They don’t, just 10000 is not large enough :-) But yeah, apparently 32-bit 10k division works on GCC ARM as well: https://godbolt.org/g/ebJqrw
> why is TFA worried about the speed of itoa?
Because itoa is slow.
I once needed to read / write large (100MB+) G-code files. Profiler showed me it spent majority of time in standard library routines like itoa / atof / etc.
> What I generally wonder is, why ISAs don't have special instructions for a task that is so common.
ISAs used to have instructions to support binary coded decimals (BCD). BCD encoding stores values as 0x[0-9][0-9], and the instructions provide math operations on that representation. On x86, these still exist, for example AAD: https://c9x.me/x86/html/file_module_x86_id_2.html
Other ISAs (eg 6502) had a processor flag to switch between regular and BCD mode.
IBM System/360 and its descendants actually does have instructions to convert between decimal and binary representations (CVD, CVB), decimal arithmetic and even instructions for formatting numbers (ED). I should also mention PACK and UNPK to convert between character and packed decimal number representation.
Today, of course, it is less important to have those instructions because computers are used less to crunch numbers and more to do other stuff.
I think we agree, but I would say computers are used _more_ to crunch numbers. Inside the system, everything is binary; you only need decimal to present data to humans.
With old time COBOL workloads, a lot of the stuff you do was presenting stuff to humans. The typical banking application, for instance, reads numbers, does a few additions, and shows them on a terminal or prints a report. Using fixed-point arithmetic was normal.
Nowadays, we do way more floating point arithmetic (for graphics, for machine learning)
Also, I think the reason those complex instructions aren’t created anymore is pipelining. I can’t find timing information for the ED instruction, but I think it is a safe bet it isn’t constant cycle count, let alone single-cycle. It not only writes a character buffer, it also has to read the bytes containing the pattern defining how to format the number.
What you are saying is only partly true. Yes, the packed decimal instructions are horrible today from performance perspective, because they require direct access to memory.
However, recently, IBM has added decimal floating-point arithmetic to z/Architecture CPUs (and also made it part of updated IEEE 754 standard). These modern instructions do not suffer from the problems of old instructions, and in fact should be preferred to packed decimal for applications that need decimal point arithmetic.
And more broadly, IBM still adds a lot of specialized instructions to mainframe CPUs. For example, calls to cryptographic accelerators.
ISA design is a highly quantitative business. You take a bunch of application benchmarks, compile them with you proposed new instruction, and measure speedup.
So the answer is one of:
1) people have, and decided against it.
2) people have dismissed it in an earlier phase, by observing that it doesn't make a big enough blimp in CPU profiles to warrant further investigation.
Also, it's hard to speed up very much in hardware because it's a bunch of interdependent divides which current cpu divide instructions are already good at. If the hardware acceleration interface is vector style, it's doubtful if many applications can easily take advantage of it. This article talks about printf, which certainly can't.
Do you have an area of applications in mind that spends a big slice of its computation time on integer -> ascii conversions?
> Also, it's hard to speed up very much in hardware because it's a bunch of interdependent divides
Most things which seem sequential are actually easily parallelized. This is one of them; all of the divides can be done in parallel, and since divisions are by constants it all falls into some fairly simple hardware.
The real argument is more that it's a bit pointless to hardware accelerate something that takes 10-20ns in optimised software if it's not used 100 billion times per day per core.
Do you have an area of applications in mind that spends a big slice of its computation time on integer -> ascii conversions?
I don't know if the situation has changed now, but a lot of the "big data" systems I worked with several years ago liked to use text formats (JSON, CSV, TSV, XML) for their input/output --- and I always thought it was a pretty stupid idea, since the data volumes were in the gigabytes and terabytes range.
...and more puzzlingly, those opcodes just became invalid, and weren't reused for something more useful in 64-bit mode. It reminds me of LAHF/SAHF and segmentation --- both features that AMD decided to remove initially for AMD64, but were forced to reintroduce after it was discovered that things like virtualisation needed them. Unfortunately, the same is unlikely to be true of the BCD instructions...
This benchmark, lut and divide&conquer lut, seems to be really slow compared to paul's new itoa.
His biggest trick is to preallocate the target buffer size, and to avoid all this heavy branching.
> It’s a stupid problem to have–just use a binary format
Not really. A safe, efficient and future-proof binary format is not a simple problem to solve. At the very least, you'll need something like base-127 encoded digits.
At this point you might as well go with base 10 and have human readability as a side bonus.
P.S. "Just use protobuf" is not a solution either. It's only an option because you're using a pre-built library solution, and there's no apriori reason to pick protobuf over some other pre-built solution.
> P.S. "Just use protobuf" is not a solution either.
It's one solution, which makes it a solution. It also happens to be a good solution for 95%+ of data-interchange use-cases.
JSON is a genuinely terrible approach to interchange for most anything that you expect to grow to be nontrivial. Whatever your thought of static vs. dynamic typing for languages, schemaless interchange formats are absolute madness — an analogue to what people are rapidly learning about schemaless databases. "Schemaless" formats don't actually mean there isn't a schema. You have one, it's just informal and you lack any useful tools to manipulate it or make changes in the future.
What you gain in early development speed is lost many times overby not actually stopping to think about how your data will be modeled beforehand and it will be lost many times over again when you need to change your informal parsing logic (e.g., random hash accesses) that's spread across dozens of unrelated areas of your code and impossible to locate. As an added bonus, you often end up having to indefinitely support every buggy, half-baked version of this format going backward to the beginning of time.
JSON is terrible format for data storage. But for data exchange it is OK. The problem with schemaless storage is that as code evolve it is way to easy to forget to cover the needs of already stored but not presently accessed data. With communications this is much less a problem as serialization and parsing must evolve together with code.
You have this exact same issue with JSON for data interchange, unless there are exactly one producer and one consumer and you can deploy both simultaneously.
If you have more than one producer, more than one consumer, or you don't have the control to update all participants simultaneously, you are setting yourself up for pain.
At this point you might as well go with base 10 and have human readability as a side bonus.
How often does a human need to "process" the data, vs. a computer? I realise the environment is a little different with today's HLL programmers, but when you do need to, it's not as if reading hexdumps is all that hard either (unless it's something like ASN.1 PER, in which case you'll likely be using a tool to assist anyway.)
I've worked with many systems using custom binary protocols, and everyone on those teams pretty much knew how to read and write them directly from the hexdump. From that perspective, I'd say text-based formats are an unnecessary overhead for all use cases except those where (non-developer) humans are expected to manipulate the format directly and often. Another memorable quote I remember from a coworker when we had a similar discussion (long ago): "text can be ASCII or EBCDIC or whatever other crazy character set someone decides to use. A bit is a bit. 1 and 0 are unambiguous."
ASN.1 is also rather complex and the most widely used encoding rules have been shown many times to be rather difficult to implement correctly and securely [1]. X.696 (OER) is said to be better, but it is only around since a few years (!).
[1] A property shared by many IT industry standards.
> Not really. A safe, efficient and future-proof binary format is not a simple problem to solve.
Neither is a safe, efficient, and future-proof text format. If you use an existing text serialization, you now have effectively two parsers instead of one.