His point is that instead of byte swapping input, we should always use single-byte load operations because "it works for him."
But Plan9 is not a system known for its graphics, and I think performance would seriously suffer if everyone had to program like that. Being able to load a pixel as an int is the reason 32-bit RGB is used more often as a pixel format than 24-bit.
Of course it might not matter as much these days, GCC and LLVM can optimize his code sequences into bswap instructions automatically. And SIMD/shader code don't have endian portability problems I know of, if only because SIMD is already not portable.
GCC doesn't merge the 4 byte-level reads into one 32-bit read. Thus, it does cause some performance penalty. The true impact is probably quite low, but it does exist on x86.
It is true however, that GCC will take a series of bit ops and produce a 'bswap' instruction on x86, but that requires a full 32-bit word to start with.
On modern CPUs, a byte load/store is really an integer (i.e. 32-bit/64-bit depending on arch) load/store that is rigged to only affect the target byte. On IA64 and PPC, it would just SIGBUS out (as it probably should on x86/amd64 too, but they kept it for compat reasons)
AFIK ARM processors don't support misaligned word access.
AFIK misaligned word access is twice slower than aligned word access (requires 2 reads). So I don't understand "offers it for free". But this is still twice faster than the example code. Note that endianess and word alignment are two distinct problems.
The point made by the author addresses this issue from a different angle.
As the author say, programmers should always write endianess neutral code unless it is impossible which is generally at the interfaces, where data is read and written (I/O) by the program. If the code is correctly and intelligently optimized so that marshaling is done once, then the byte swapping may generally be expected to be a low frequency operation. In this case the most simple and portable code should be favored.
Trying to optimize this operation by word read and byte swapping provides an insignificant optimization with a higher cost on code portability and maintainability. The author is right on this.
Though it is also true that in some cases, the operation frequency is very high (i.e. reading million pixel values of an image). For these use cases, the programming overhead of using highly optimized code is perfectly justified. But then don't use half backed optimizations. Try to align data on words (twice faster), read by word (four time faster) and use byte swapping machine instruction available on the target CPU instead of the proposed shifts and bit masks.
My opinion is that good languages should provide optimized data marshaling functions in their library so that the code can be optimal and portable at the same time.
ARM supports unaligned memory accesses since v6. In most modern implementations, unaligned accesses falling entirely within a 16-byte aligned block have no penalty at all, while crossing 16-byte boundaries does impose a cost. If the locations of unaligned accesses are randomly distributed, this cost is still cheaper on average than accessing a byte at a time.
I agree with his suggestion that most code manipulating byte order is incorrect or unneeded. Within your application's logic, everything should already be uniform.
I agree slightly less with "computer's byte order doesn't matter", differentiated from peripheral and data stream byte order... that's the same friggin thing. It matters that you treat your inputs and outputs correctly, and how you do that depends on your computer's byte order, so they computer's byte order does matter. Just not so much during the data processing stage.
But mostly, I'm just saddened that every post about C now has a "only people who do [X thing that requires C] do that, and you're probably not one of them, so you should do that!" Maybe there's just a huge disconnect between people-who-blog and people-who-write-low-level-code, but most of the software guys I know have worked professionally on microcontrollers, DSPs, operating systems, or compilers within the last 5 years, and I'm working on a compiler for a DSP right now (and I expect byte-order to matter).
Also, it's been 5 years since I had to deal with it, but I remember endian mattered for some image file formats, and also for blitting to the screen for Mac vs PC.
No offense brother, but Rob Pike shits all over you and any of your "software guys". The man is a living legend. This is not to say that he can't be wrong, but to call him just a "person who blogs" only shows how little you know.
How does that make it any better? If the name on the blog is what makes you think a post is shit or gold, then you are probably not a very good critical thinker.
I wish there were, in C, some equivalent of "struct" but where you could specify
- The byte order / endianness
- The alignment of variables
- The exact width of variables (32-bits, 64-bits)
"struct" is great for what it was designed for, storing your internal data structures in a way efficient for the machine.
But everyone abuses structs and tries to read external data sources e.g. files using them. They might hack it to work on their own machine, then as soon as a machine of the other endianness comes along, hacks and #ifdefs appear, then machines with ints of different widths come along....
Of course these people are using structs "wrong", like the author of the article suggests. But nevertheless, the fact that people are using structs "wrong" suggests there is a need for something that provides what people are trying to use structs for.
> I wish there were, in C, some equivalent of "struct" but where you could specify
I don't think there's a need for a struct-equivalent, I think there's a need for a struct loader with these capabilities, it would only be in charge of packing and unpacking but would shove everything in a struct.
Basically, Erlang's bit syntax for C (thought the bit syntax unpack to locals, not structs):
<< Foo:16/little, Bar:12/signed, Baz:4 >> = Bin.
(default type specifications are integer, unsigned and big-endian, between the : and / is the size of the data in "units", where Unit defaults to 8 bits for the binary type and 1 bit for integers, floats and strings).
Doesn't natively do alignment though, the developer has to pad on his own.
Python's `struct` module is similar[0]: http://docs.python.org/library/struct.html although the format string is basically unreadable and I believe it's absolutely terrible at decoding non-standard sizes (e.g. an int stored on 3 bits)
[0] libpack[1] for C, Perl also has this[2] which was probably the inspiration for Python
Yeah, C is really missing a way to serialize/deserialize data from raw memory/sockets into structs usable by your code. The least insane way, libpack [1] requires replicating the data format definition three times:
* define the struct with all elements
* define a string for the binary representation
* call fpack/funpack with the string and all the struct elements as parameters...
Unfortunately, fixing this either requires some kind of black X-macro [2] magic or another template language used to write the specification and to generate the three above-mentioned representations from it...
Surely this could be handled via simple syntactic extensions to the struct specification (with everything wrapped into an ungodly macro from hell) in order to define the mapping between the struct itself and libpack's format string, no?
> The problem is that you need to replicate the struct entries in the pack/unpack calls as well
Don't you only need the (generated) format string? Ideally, the macro could generate some wrapper function of some sort as well, which would unpack, fill and return an instance of the struct.
I've been working on some code to parse the shapefile format, which specifies some fields in big-endian and some in little (I have no idea why), but the binary package has made it really easy to deal with.
> You can't tackle the BO issue here, as it is a property of the underlying architecture.
I disagree; if you have "littleendian int32 myfield" for example, every time you reference myfield on a big-endian architecture, the compiler inserts the necessary byte manipulation code, just like the guy does manually in the original post.
I agree. I think you may have misunderstood my comment. The wire representation should be read into memory via any byte order manipulations that are needed.
Why "every time you read the variable" then? You'd have a packed-binary-struct and a memory-representation-struct and conversions between them to use when reading or writing. I say "would", but I actually already do this. I have typedefs of the kind
typedef union {
uint8_t bytes[sizeof(uint64_t)];
uint64_t native; /* alignment hint */
} uint64le_t;
which I use in structs of my on-disk data structures.
It would, in the worst case, not be worse than what the original poster is suggesting.
And at least in that case, the code would be more readable.
But, as the compiler knows more about what's going on (it's not just parsing and compiling a general expression with ORs and shifts) then it could well be faster (e.g. if there were a CPU instruction to do this, then it could be used, etc.).
I'd be curious to know why you want it as a language feature rather than something like google protocol buffers. I don't necessarily disagree, I'm just curious what your reasoning is.
That's a good question. (One I hadn't thought about much before I have to admit.)
I think it would be easier for the programmer to use a language feature than a library; the resulting code would be easier to read. I'm thinking how regular expressions are easier to use in Perl than they are in Java because they're part of the language, or how maps are easier to use in scripting languages than say Map in Java because they're part of the language.
If you had to have some e.g. string or external file describing the syntax of the hardware-independent struct, and then some calls like "read_entity(void structure, char fieldname)", it would all get nasty - you're going to have strings in the code which can't be checked at compile-time, you're going to be doing type casting which can't be checked at compile-time, and so on.
But a code generation system could be an option - you define the structure in a file in a certain syntax, and then code is generated with the right types and attribute names being visible to the compiler.
But I still think being part of the language would just be simpler for the user, and I don't consider this to be some obscure feature which would dilute the purity of the language by its introduction; binary file formats and protocols are here to stay.
P.S. Yes Google Protocol Buffers could well be the thing I've been searching for since I first saw the C struct many years ago.
Yeah, protocol buffers are a bit heavyweight, a bit harder to read than a language feature could be, and add an extra dependency. That's why I don't necessarily disagree, though the reason I asked is it occurred to me that any time I have personally needed to deal with byte ordering has been code that can easily justify a heavyweight solution (unlike regex's which end up in tiny scripts that may only be run once) and they have also all been places where being able to easily interact with with the same data in other languages would have been super useful, as would being able to easily add to the format without a versioning headache.
... of course, now that I write that I realize a language feature could actually provide all of that, too.
Rob Pike was specifically talking about binary streams. There are many cases where you can make simplifying assumptions in the name of speed; this is quite common in the Linux Kernel, where (a) lots of code uses it, so optimizing for every last bit of CPU efficiency is important, and (b) we need to know a lot about the CPU architecture anyway, so it's anything _but_ portable code. (Of course, we do create abstractions to hide this away from the programmer --- i.e., macros such as le32_to_cpu(x) and cpu_to_le32(x), and we mark structure members with __le32 instead of __u32 where it matters, so we can use static code analysis techniques to catch where we might have missed a le32_to_cpu conversion or vice versa.)
What are some of the assumptions which Linux makes? For one, that there is a 32-bit type available to the compiler. For just about all modern CPU architectures where you might want to run Linux, this is true. This means that we can define a typedef for __u32, and it means that we can declare C structures where we can use a structure layout that represents the on-the-wire or on-the-disk format without needing to do a pull the bytes, one at a time, off the wire decoding stream. It also means that the on-the-wire or on-disk structures can be designed to be such that integers can be well aligned such that on all modern architectures such that we don't have to worry about unaligned 32-bit or 64-bit accesses.
And it's not just Linux which does this. The TCP/IP headers are designed the same way, and I guarantee you that networking code that might need to work at 10 Gbps isn't pulling off the IP headers one byte at a time and shifting them 8 bits at a time, to decode the IP header fields. No, they're dropping the incoming packet on an aligned buffer, and then using direct access to the structures using primitives such as htonl(). (It also means that at least for the forseeable future, CPU architectures will be influenced by the implementation and design choices of such minor technologies such as TCP/IP and the Linux kernel, so it's a fair bet that no matter what, there will always be a native 32-bit type, for which 4-byte aligned access will be fast.)
The original TCP/IP designers and implementors knew what they were doing, and having worked with some of them, I have at least as much respect, if not more so, than Rob Pike...
The author claims that byte swapping code
- "depends on integers being 32 bits long, or requires more #ifdefs to pick a 32-bit integer type."
True. But you might consider using inttypes.h which defines some pretty useful things like uint32_t (an unsigned 32 bit wide integer for example).
- "may be a little faster on little-endian machines, but not much, and it's slower on big-endian machines."
In fact swapping the byte order is _one_ CPU instruction.
You can for example use some inline assembly to optimize your code. (If your compiler fails to recognize this pattern.)
The syntax is specific to GCC. "=g"(x) tells it that x is written by the assembly, so the compiler can't propagate an earlier value through the asm. The "g" means that the compiler can store x in a register, or in memory, etc. "%0" is replaced with "x" in the assembly.
I noticed ifdefs like this when I inherited some C code back in 2001. I'd always worked on x86 systems, so never really encountered machines with different byte orders. The code was fugly, and I didn't like it, so I studied it some and it hit me that it didn't matter what the byte order was. If I constructed a 32 bit int and assigned it to a 32 bit int, the compiler would take care of the byte order. All I needed to know was the byte order of the network protocol we were using.
Tested new code on my x86 box and it worked. Then just committed to sourceforge CVS and told the rest of the world to test. It worked. My code looked a lot like Rob's.
The smallest questions always cause the most heated debate.
It doesn't really matter much how the possible byte order swap is done: what matters that these ifdefs aren't littered around the code and byte-order swapping is limited to the lowest level where data is actually read from an external source.
I would personally go with his byte array reads as it's less confusing but I would still wrap the functionality inside inlined functions like these:
Nice! I think I'll use this. I was going to suggest also including calls to platform-specific byte swapping functions for compilers other than GCC, but maybe it's better to wait until it's actually demonstrated to be a benefit.
It's by Rob Pike. Not a wise choice of target to nitpick.
When he says "I guarantee that [...]", I'm inclined to take his word for it.
Nice piece, clearing up a cobweb in a poorly lighted corner. And teaches (with code example) what one really needs to know about handling byte order in data streams.
I wish I had time to write a more thorough response right now, but I just did a short test with Debian sid and its GCC 4.6.3 on a modern Xeon machine under Xen (so, not the best performance testing device, so take this with some salt).
At -O9, the compiler optimizes a masks-and-shifts swap of a uint64_t into a bswapq instruction identical to the one emitted by the GCC-specific __builtin_bswap64; this can be coupled with an initial memcpy into a temporary uint64_t. Loading individual bytes and shifting them in emits a pile of instructions that take up 16 times as much code space and ~35% runtime penalty (2.7 s versus 2 s). This is measured in a loop decoding a big-endian integer into a native uint64_t and writing it to a volatile extern uint64_t global, 2^30 iterations, function called through a function pointer.
Aligned versus unaligned pointers seem to make no real difference on this CPU, using a static __attribute__((aligned(8))) uint8_t[16] and offsets of 0 (aligned) and 5 (unaligned) from the start of the array.
I also tried a function with the explicit cast-shift-or that uses an initial memcpy into a local uint8_t[8] in case the compiler was doing something strange with regard to memory read fault ordering as compared to the explicit memcpy in the two bswapq-generating versions. This resulted in some very "interesting" code that shoves the local array into a register and then very roughly masks and shifts all the bits around, at about a 100% penalty from the bswapq functions. :-(
If anyone's interested in the details, reply and I'll try to put them somewhere accessible, though it may take a little while.
This isn't surprising. If the set the AC bit on x86, then it will disallow unaligned accesses and you'll be operating in an environment more similar to RISC machines. In order to allow such a thing to succeed, GCC can't produce a 32-bit read from char* address since the alignment is only guaranteed to be 1 (i.e. no alignment) and this would trigger SIGBUS. Thus, in order to get a 32-bit read, you must deref a 32-bit variable, not 4x 8-bit ones. This makes even more sense on RISC systems where this "optimization" would be a tragic bug you'd want to work around in your compiler. See my post with the x86 assembly output confirming your general results.
> The byte order of the computer doesn't matter much at all except to compiler writers and the like
Binary protocol parsing is one area that relies heavily on byte ordering, struct packaging and alignment. Tangentially, binary file parsing that is optimized for speed will have the same dependency. In fact, anything that deals with fast processing of the off-the-wire data will want to know about the byte order.
Actually it's kind of funny... I recently wrote a base64 encoder/decoder that makes use of native endian order to build 16-bit unsigned int based lookup tables that map to the correct byte sequence in memory regardless of native endianness.
Looking up a 16-bit int rather than 2 chars, and outputting as a 32-bit int rather than 4 chars yields a nice performance boost at the cost of possibly not being portable for some more esoteric architectures that don't have a 16 and 32-bit unsigned int type.
So while he's right that 99% of the time you shouldn't be fiddling with byte order, it still pays to know how to wield such a tool, and it's most definitely not just for compiler writers.
CPU byte order definitely matters to device drivers read/writing across the I/O bus: they must perform wide aligned reads and writes using single CPU loads and stores.
Rob's approach simply won't work there. Similarly, OS-bypass networking and video, which expose hardware device interfaces in user space, require CPU-endian aware libraries.
That said, use Rob's portable approach anytime you don't have a compelling reason not to, if only to not have to worry about alignment and portability. Doing otherwise is premature optimization and a maintenance headache.
There's one and only one reason we write code the way he says not to: performance. Working with words instead of byte munging makes a huge performance difference. And in game development performance beats most other reasoning, especially when we are talking about loading tens of thousands of these on startup. And besides, all our code is wrapped in calls to inline functions named uint32_t FromBigEndian(...) anyway, so it's actually cleaner than what he proposes.
One fallacy is that you need to ever manually convert byte order yourself in the way the article suggests. Most systems have something that'll do it for you - eg, htonl, ntohl.
I don't know why you got downvoted. Your idea is valid and people may not have realized that htonl has friends an relatives that totally make the issue of the article moot.
Main point is that you should have function that converts array of bytes to integer or vice-versa, not something like htonl which either swaps order of bytes in integer or does nothing, depending on endianity of platform.
I think this is mostly correct. Certainly when dealing with streams, it makes code more straightforward to just deal with a byte at a time. But grepping over some old code to see where I've used WORDS_BIGENDIAN, I see cases where I defined a typedef struct for a memory mapped binary data format. That is one place where you would sacrifice performance and clarity by dealing with bytes.
LSB-MSB enables certain useful addressing modes in the 6502, e.g. fast access to the zero page. Therefore IMHO little-endian is better. I'm not really interested in 16-bit and above :-)
But now he needs 2 variables, i and data, while otherwise I can just read in i and swap it afterward on a big-endian machine (assuming I read in little-endian certainly).
You realize your code is littered with just those defines which the article wrote are not necessary at all? You are exactly proving my point, you don't need a second variable in the case where you don't need to switch bytes when you use a define. And the trick is to use the define only after you already have the value already in the integer. In his solution you would need the data pointer which you have put into the define _always_.
You need two variables anyway to make any sense into the code.
Where would you get the value of 'i' if you didn't have 'data' and what would you do with the value read from 'data' if you didn't have 'i' or some equivalent?
I don't need a 'data' variables when I read directly from file. For example: "fread( (void *)&i, sizeof(i), 1, file);". Works also with (packed) structs. For the solution as given in the article I need now a second variable 'data' first to buffer the information I read from file.
edit: I mention this case as it covers around 90% of the cases I've seen on a quick check over a codebase I'm working with (Irrlicht). Swapping endian is nearly always done after reading in the data from a file-stream.
> Let's say your data stream has a little-endian-encoded 32-bit integer. Here's how to extract it (assuming unsigned bytes):
i = (data[0]<<0) | (data[1]<<8) | (data[2]<<16) | (data[3]<<24);
Wait, if byte order doesn't matter, why do I need to do byte-level array lookups when i'm processing a stream of integers? Oh yeah, because byte order does matter. If byte order wouldn't matter (say, if all computers were 32-bit, had the same byte order and the same endianness), I could just cast the stream to int* and be done with it. I can't, because of byte order. It matters.
Whether you deal with it using byte-array lookups and math or #ifdefs and bitmasks, well, whatever rocks your boat man! Good that you're taking it into account, because byte order matters!
I think that skrebel is trying to say that doing as the article suggests comes with an unnecessary performance penalty hit if your CPU has the same endianness as the data stream.
Why should there be a any performance penalty? A good compiler (and I've worked with at least one that could) would know the machine's endianness and could optimize away that sequence of selections, shifts and ors when it isn't needed.
"...the computer's byte order shouldn't matter to you one bit. Notice the phrase "computer's byte order". What does matter is the byte order of a peripheral or encoded data stream"
The point is, when you use explicit shifting of bytes you have same code that works independently of endianity, the fact that compiler is probably going to generate exactly same code seems to me like good argument to go with the more readable choice (ie. explicit shifts), also variant proposed by article is actually portable C, anything involving casting arrays of one type to arrays of another incompatible type is not.
No modern architecture can access arbitrarily aligned words in memory directly (presence of caches modifies things slightly, but shifts the problem from data bus width to cache line width as unaligned word can still span two cache lines). There are generally two solutions to this: disallow that at CPU level (and handle that by raising SIGBUS), emulate it in hardware by doing two memory accesses for one load or store (which involves significant additional complexity), Intel invented third solution in i386: OS can select between these two behaviors.
I would argue we, as a community, need to write a portable, yet optimised, byte order convertors. htobe, htole, htobel, htolel, htobell, htolell, and vice versa.
Converting byte order of integer is mostly pointless operation (which is what the article tries to say), what is needed is portable, yet optimized way to build/parse portable binary structures. In my opinion there are two reasons why too much optimization in this is complete waste of time:
1) Even if compilers are not able to optimize manual conversion of integer to/from discrete bytes into same code as word sized access with optional byte order swap, it's mostly irrelevant, as there aren't going to be any significant difference in performance between one four byte access and four one byte accesses (as in both cases you end up with same number of actual memory transactions, which is the slow part, due to caches)
2) when you are handling portable binary representation of something, it's always connected to some IO, which is slow already so any performance boost that you get from microoptimalization like this is completely negligible.
I tend to just hand write few lines of C to pack/unpack integers explicitly when needed as it seems to me as the most productive thing you can do.
By the way all the big endian <-> little endian functions you propose boil down to two implementations for each size of operand: no-op and mirroring of all bytes, both of which are mostly trivial.
What is really missing is portable and efficient way to encode floating point numbers, as there is no portable way to find out their endianity and in floating point case it's more complex than just big vs. little endian.
While the article is rant-y and some details sloppy, I have to admit his core point is pretty good. Rather the writing #ifdef'd code, why not write code that works the same regardless of endianness?
That seems like a good idea. Not for -all- the reasons he mentions, but for the simple reason that it's one code path so easier to code and easier to get right.
But Plan9 is not a system known for its graphics, and I think performance would seriously suffer if everyone had to program like that. Being able to load a pixel as an int is the reason 32-bit RGB is used more often as a pixel format than 24-bit.
Of course it might not matter as much these days, GCC and LLVM can optimize his code sequences into bswap instructions automatically. And SIMD/shader code don't have endian portability problems I know of, if only because SIMD is already not portable.