The byte order fallacy

astrange · on April 4, 2012

His point is that instead of byte swapping input, we should always use single-byte load operations because "it works for him."

But Plan9 is not a system known for its graphics, and I think performance would seriously suffer if everyone had to program like that. Being able to load a pixel as an int is the reason 32-bit RGB is used more often as a pixel format than 24-bit.

Of course it might not matter as much these days, GCC and LLVM can optimize his code sequences into bswap instructions automatically. And SIMD/shader code don't have endian portability problems I know of, if only because SIMD is already not portable.

perfunctory · on April 4, 2012

> I think performance would seriously suffer

Evidence please.

figglesonrails · on April 4, 2012

===============

#include <stdint.h>

uint32_t load_uint32_be(uint8_t* p) { return (p[0] << 24) | (p[1] << 16) | (p[2] << 8) | p[3]; }

uint32_t load_uint32_le(uint8_t* p) { return (p[3] << 24) | (p[2] << 16) | (p[1] << 8) | p[0]; }

===============

gcc -O3 -fomit-frame-pointer -S bo.c

===============

load_uint32_be: movl 4(%esp), %edx

        movzbl  (%edx), %eax

        movzbl  1(%edx), %ecx

        sall    $24, %eax

        sall    $16, %ecx

        orl     %ecx, %eax

        movzbl  3(%edx), %ecx

        movzbl  2(%edx), %edx

        orl     %ecx, %eax

        sall    $8, %edx

        orl     %edx, %eax

        ret

load_uint32_le: movl 4(%esp), %edx

        movzbl  3(%edx), %eax

        movzbl  2(%edx), %ecx

        sall    $24, %eax

        sall    $16, %ecx

        orl     %ecx, %eax

        movzbl  (%edx), %ecx

        movzbl  1(%edx), %edx

        orl     %ecx, %eax

        sall    $8, %edx

        orl     %edx, %eax

        ret

GCC doesn't merge the 4 byte-level reads into one 32-bit read. Thus, it does cause some performance penalty. The true impact is probably quite low, but it does exist on x86.

It is true however, that GCC will take a series of bit ops and produce a 'bswap' instruction on x86, but that requires a full 32-bit word to start with.

anaisbetts · on April 4, 2012

On modern CPUs, a byte load/store is really an integer (i.e. 32-bit/64-bit depending on arch) load/store that is rigged to only affect the target byte. On IA64 and PPC, it would just SIGBUS out (as it probably should on x86/amd64 too, but they kept it for compat reasons)

vardump · on April 4, 2012

Actually, a modern CPU loads a whole L1 cache line at once. Which is usually 64 bytes nowadays.

pagekalisedown · on April 4, 2012

This can also result in 2 reads if the memory isn't aligned.

astrange · on April 4, 2012

Desktop PPC CPUs (when there were such things) allowed misaligned memory operations with some performance penalty.[1]

x86 practically offers it for free in newer architectures (Sandy Bridge, Ivy Bridge and Bulldozer).[2]

[1] https://developer.apple.com/hardwaredrivers/ve/g5.html

[2] http://agner.org/optimize/instruction_tables.pdf (check MOVDQU timings)

chmike · on April 4, 2012

AFIK ARM processors don't support misaligned word access. AFIK misaligned word access is twice slower than aligned word access (requires 2 reads). So I don't understand "offers it for free". But this is still twice faster than the example code. Note that endianess and word alignment are two distinct problems.

The point made by the author addresses this issue from a different angle.

As the author say, programmers should always write endianess neutral code unless it is impossible which is generally at the interfaces, where data is read and written (I/O) by the program. If the code is correctly and intelligently optimized so that marshaling is done once, then the byte swapping may generally be expected to be a low frequency operation. In this case the most simple and portable code should be favored.

Trying to optimize this operation by word read and byte swapping provides an insignificant optimization with a higher cost on code portability and maintainability. The author is right on this.

Though it is also true that in some cases, the operation frequency is very high (i.e. reading million pixel values of an image). For these use cases, the programming overhead of using highly optimized code is perfectly justified. But then don't use half backed optimizations. Try to align data on words (twice faster), read by word (four time faster) and use byte swapping machine instruction available on the target CPU instead of the proposed shifts and bit masks.

My opinion is that good languages should provide optimized data marshaling functions in their library so that the code can be optimal and portable at the same time.

mansr · on April 4, 2012

ARM supports unaligned memory accesses since v6. In most modern implementations, unaligned accesses falling entirely within a 16-byte aligned block have no penalty at all, while crossing 16-byte boundaries does impose a cost. If the locations of unaligned accesses are randomly distributed, this cost is still cheaper on average than accessing a byte at a time.

whatusername · on April 4, 2012

So it's not quite a desktop... but the standalone server theoretically could be one I guess: http://www.nasi.com/ibm-power-720-express.php

jbarham · on April 4, 2012

> Being able to load a pixel as an int is the reason 32-bit RGB is used more often as a pixel format than 24-bit.

No: RGBA.

pmjordan · on April 4, 2012

Nope. Even 24-bit RGB with no alpha is stored with a wasted byte in video memory these days. (outside of legacy video modes) The reason is alignment.

mrmekon · on April 4, 2012

I agree with his suggestion that most code manipulating byte order is incorrect or unneeded. Within your application's logic, everything should already be uniform.

I agree slightly less with "computer's byte order doesn't matter", differentiated from peripheral and data stream byte order... that's the same friggin thing. It matters that you treat your inputs and outputs correctly, and how you do that depends on your computer's byte order, so they computer's byte order does matter. Just not so much during the data processing stage.

But mostly, I'm just saddened that every post about C now has a "only people who do [X thing that requires C] do that, and you're probably not one of them, so you should do that!" Maybe there's just a huge disconnect between people-who-blog and people-who-write-low-level-code, but most of the software guys I know have worked professionally on microcontrollers, DSPs, operating systems, or compilers within the last 5 years, and I'm working on a compiler for a DSP right now (and I expect byte-order to matter).

pavpanchekha · on April 4, 2012

In this case, the person-who-blogs is, I believe, Rob Pike, one of the original Bell Labs & Plan 9 pioneers. I think there's not much of a disconnect.

jobu · on April 4, 2012

The only code I've used lately that cares about endian is SQLite (http://www.sqlite.org/search?q=endian)

Also, it's been 5 years since I had to deal with it, but I remember endian mattered for some image file formats, and also for blitting to the screen for Mac vs PC.

oh_sigh · on April 4, 2012

No offense brother, but Rob Pike shits all over you and any of your "software guys". The man is a living legend. This is not to say that he can't be wrong, but to call him just a "person who blogs" only shows how little you know.

varikin · on April 4, 2012

Or it just shows that not everyone looks at the about me on every blog to see that Rob is Rob Pike.

oh_sigh · on April 10, 2012

How does that make it any better? If the name on the blog is what makes you think a post is shit or gold, then you are probably not a very good critical thinker.

adrianmsmith · on April 4, 2012

I wish there were, in C, some equivalent of "struct" but where you could specify

- The byte order / endianness

- The alignment of variables

- The exact width of variables (32-bits, 64-bits)

"struct" is great for what it was designed for, storing your internal data structures in a way efficient for the machine.

But everyone abuses structs and tries to read external data sources e.g. files using them. They might hack it to work on their own machine, then as soon as a machine of the other endianness comes along, hacks and #ifdefs appear, then machines with ints of different widths come along....

Of course these people are using structs "wrong", like the author of the article suggests. But nevertheless, the fact that people are using structs "wrong" suggests there is a need for something that provides what people are trying to use structs for.

masklinn · on April 4, 2012

> I wish there were, in C, some equivalent of "struct" but where you could specify

I don't think there's a need for a struct-equivalent, I think there's a need for a struct loader with these capabilities, it would only be in charge of packing and unpacking but would shove everything in a struct.

Basically, Erlang's bit syntax for C (thought the bit syntax unpack to locals, not structs):

    << Foo:16/little, Bar:12/signed, Baz:4 >> = Bin.

(default type specifications are integer, unsigned and big-endian, between the : and / is the size of the data in "units", where Unit defaults to 8 bits for the binary type and 1 bit for integers, floats and strings).

Doesn't natively do alignment though, the developer has to pad on his own.

Python's `struct` module is similar[0]: http://docs.python.org/library/struct.html although the format string is basically unreadable and I believe it's absolutely terrible at decoding non-standard sizes (e.g. an int stored on 3 bits)

[0] libpack[1] for C, Perl also has this[2] which was probably the inspiration for Python

[1] http://www.leonerd.org.uk/code/libpack/intro.html

[2] http://perldoc.perl.org/functions/unpack.html

ge0rg · on April 4, 2012

Yeah, C is really missing a way to serialize/deserialize data from raw memory/sockets into structs usable by your code. The least insane way, libpack [1] requires replicating the data format definition three times:

* define the struct with all elements

* define a string for the binary representation

* call fpack/funpack with the string and all the struct elements as parameters...

Unfortunately, fixing this either requires some kind of black X-macro [2] magic or another template language used to write the specification and to generate the three above-mentioned representations from it...

[1] http://www.leonerd.org.uk/code/libpack/intro.html

[2] http://drdobbs.com/184401387

masklinn · on April 4, 2012

> or another template language

Surely this could be handled via simple syntactic extensions to the struct specification (with everything wrapped into an ungodly macro from hell) in order to define the mapping between the struct itself and libpack's format string, no?

ge0rg · on April 4, 2012

The problem is that you need to replicate the struct entries in the pack/unpack calls as well, which is only possible in plain C by using X-Macros.

It might be possible to construct a macro that creates both the struct and the format string, though.

masklinn · on April 4, 2012

> The problem is that you need to replicate the struct entries in the pack/unpack calls as well

Don't you only need the (generated) format string? Ideally, the macro could generate some wrapper function of some sort as well, which would unpack, fill and return an instance of the struct.

signa11 · on April 4, 2012

just took a brief look at it. doesn't seem to support arbitrary bit-fields e.g. how would extract next 13 bits from a binary stream ?

epe · on April 4, 2012

Go has a great solution for this in the binary package:

http://golang.org/pkg/encoding/binary/#Read

I've been working on some code to parse the shapefile format, which specifies some fields in big-endian and some in little (I have no idea why), but the binary package has made it really easy to deal with.

alexchamberlain · on April 4, 2012

You can't tackle the BO issue here, as it is a property of the underlying architecture.

However, the other 2 issues can be tackled. The first by using a packed struct ( __attribute__((__packed__)) in gcc) and the second by using stdint.h.

adrianmsmith · on April 4, 2012

> You can't tackle the BO issue here, as it is a property of the underlying architecture.

I disagree; if you have "littleendian int32 myfield" for example, every time you reference myfield on a big-endian architecture, the compiler inserts the necessary byte manipulation code, just like the guy does manually in the original post.

alexchamberlain · on April 4, 2012

Is this efficient though? One conversion designed in is clearly faster than a conversion every time you read the variable?

pmjordan · on April 4, 2012

You shouldn't in general be using on-disk/network binary formats for your in-memory representation.

alexchamberlain · on April 4, 2012

I agree. I think you may have misunderstood my comment. The wire representation should be read into memory via any byte order manipulations that are needed.

pmjordan · on April 4, 2012

Why "every time you read the variable" then? You'd have a packed-binary-struct and a memory-representation-struct and conversions between them to use when reading or writing. I say "would", but I actually already do this. I have typedefs of the kind

  typedef union {
    uint8_t bytes[sizeof(uint64_t)];
    uint64_t native; /* alignment hint */
  } uint64le_t;

which I use in structs of my on-disk data structures.

  struct ondisk_range
  {
    uint64le_t start;
    uint64le_t end;
  };

Then for actually using that data:

  struct range
  {
    uint64_t start;
    uint64_t end;
  };

The conversion functions basically just call functions with these prototypes on the 2 fields:

  uint64_t le64_to_cpu(uint64le_t le_val);
  uint64le_t cpu_to_le64(uint64_t val);

Which internally just read out the byte array and turn it into an integer and vice versa.

(I also have static assertions for the expected size following each ondisk struct)

It'd be nicer to codify this as a DSL or something, but C's macro system really isn't up to the job.

alexchamberlain · on April 4, 2012

adrianmsmith is suggesting that everytime the variable is read, the compiler inserts byte-swapping code. I disagree with that.

adrianmsmith · on April 4, 2012

It would, in the worst case, not be worse than what the original poster is suggesting.

And at least in that case, the code would be more readable.

But, as the compiler knows more about what's going on (it's not just parsing and compiling a general expression with ORs and shifts) then it could well be faster (e.g. if there were a CPU instruction to do this, then it could be used, etc.).

pmjordan · on April 4, 2012

I don't necessarily see a problem with that: it would avoid the messy cpu_to_le64 calls.

alexchamberlain · on April 4, 2012

pmjordan · on April 4, 2012

http://en.wikipedia.org/wiki/Domain-specific_language

furyofantares · on April 4, 2012

I'd be curious to know why you want it as a language feature rather than something like google protocol buffers. I don't necessarily disagree, I'm just curious what your reasoning is.

adrianmsmith · on April 4, 2012

That's a good question. (One I hadn't thought about much before I have to admit.)

I think it would be easier for the programmer to use a language feature than a library; the resulting code would be easier to read. I'm thinking how regular expressions are easier to use in Perl than they are in Java because they're part of the language, or how maps are easier to use in scripting languages than say Map in Java because they're part of the language.

If you had to have some e.g. string or external file describing the syntax of the hardware-independent struct, and then some calls like "read_entity(void structure, char fieldname)", it would all get nasty - you're going to have strings in the code which can't be checked at compile-time, you're going to be doing type casting which can't be checked at compile-time, and so on.

But a code generation system could be an option - you define the structure in a file in a certain syntax, and then code is generated with the right types and attribute names being visible to the compiler.

But I still think being part of the language would just be simpler for the user, and I don't consider this to be some obscure feature which would dilute the purity of the language by its introduction; binary file formats and protocols are here to stay.

P.S. Yes Google Protocol Buffers could well be the thing I've been searching for since I first saw the C struct many years ago.

furyofantares · on April 4, 2012

Yeah, protocol buffers are a bit heavyweight, a bit harder to read than a language feature could be, and add an extra dependency. That's why I don't necessarily disagree, though the reason I asked is it occurred to me that any time I have personally needed to deal with byte ordering has been code that can easily justify a heavyweight solution (unlike regex's which end up in tiny scripts that may only be run once) and they have also all been places where being able to easily interact with with the same data in other languages would have been super useful, as would being able to easily add to the format without a versioning headache.

... of course, now that I write that I realize a language feature could actually provide all of that, too.

__david__ · on April 4, 2012

I feel the same way. I wrote a blog post a number of years ago with a proposal: http://porkrind.org/missives/hardware-friendly-c-structures/

justincormack · on April 4, 2012

Erlang has this type of thing I believe, part of its Telecoms heritage.

tytso · on April 4, 2012

Rob Pike was specifically talking about binary streams. There are many cases where you can make simplifying assumptions in the name of speed; this is quite common in the Linux Kernel, where (a) lots of code uses it, so optimizing for every last bit of CPU efficiency is important, and (b) we need to know a lot about the CPU architecture anyway, so it's anything _but_ portable code. (Of course, we do create abstractions to hide this away from the programmer --- i.e., macros such as le32_to_cpu(x) and cpu_to_le32(x), and we mark structure members with __le32 instead of __u32 where it matters, so we can use static code analysis techniques to catch where we might have missed a le32_to_cpu conversion or vice versa.)

What are some of the assumptions which Linux makes? For one, that there is a 32-bit type available to the compiler. For just about all modern CPU architectures where you might want to run Linux, this is true. This means that we can define a typedef for __u32, and it means that we can declare C structures where we can use a structure layout that represents the on-the-wire or on-the-disk format without needing to do a pull the bytes, one at a time, off the wire decoding stream. It also means that the on-the-wire or on-disk structures can be designed to be such that integers can be well aligned such that on all modern architectures such that we don't have to worry about unaligned 32-bit or 64-bit accesses.

And it's not just Linux which does this. The TCP/IP headers are designed the same way, and I guarantee you that networking code that might need to work at 10 Gbps isn't pulling off the IP headers one byte at a time and shifting them 8 bits at a time, to decode the IP header fields. No, they're dropping the incoming packet on an aligned buffer, and then using direct access to the structures using primitives such as htonl(). (It also means that at least for the forseeable future, CPU architectures will be influenced by the implementation and design choices of such minor technologies such as TCP/IP and the Linux kernel, so it's a fair bet that no matter what, there will always be a native 32-bit type, for which 4-byte aligned access will be fast.)

The original TCP/IP designers and implementors knew what they were doing, and having worked with some of them, I have at least as much respect, if not more so, than Rob Pike...

dlsym · on April 4, 2012

The author claims that byte swapping code - "depends on integers being 32 bits long, or requires more #ifdefs to pick a 32-bit integer type."

True. But you might consider using inttypes.h which defines some pretty useful things like uint32_t (an unsigned 32 bit wide integer for example).

- "may be a little faster on little-endian machines, but not much, and it's slower on big-endian machines."

In fact swapping the byte order is _one_ CPU instruction. You can for example use some inline assembly to optimize your code. (If your compiler fails to recognize this pattern.)

     uint32_t byte_swap( uint32_t x )
     {
         asm( "bswap %0"
            : "=g"(x)
            : "0"(x)
         );

         return x;
     }

Just my two cents...

alexchamberlain · on April 4, 2012

For someone who hasn't written inline assembly, can you briefly explain what this does? ie what does =g mean?

dlubarov · on April 4, 2012

The syntax is specific to GCC. "=g"(x) tells it that x is written by the assembly, so the compiler can't propagate an earlier value through the asm. The "g" means that the compiler can store x in a register, or in memory, etc. "%0" is replaced with "x" in the assembly.

For details on the syntax: http://wiki.osdev.org/Inline_Assembly#Clobbered_Registers_Li...

masklinn · on April 4, 2012

> In fact swapping the byte order is _one_ CPU instruction.

That's one machine instruction, I'm pretty sure it's more than one microcode instruction ;)

mansr · on April 4, 2012

On most x86 and ARM CPUs, byte-reversing a register is a single-cycle operation.

PowerPC has byte-reversing load and store instructions but lacks an instruction to reverse a register.

pbsd · on April 4, 2012

Yes and no. On modern Intel CPUs, 32-bit bswap is 1 uop, but 64-bit bswap is 2. On AMD it seems to always be just 1 uop.

dfox · on April 4, 2012

Swapping bits around is operation that is essentially free in hardware. It's just wires.

ableal · on April 4, 2012

Most hardware is just wires. Especially since transistors shrunk down to nearly nothing.

Still, the layout of something like a barrel shifter (e.g. http://www.erc.msstate.edu/mpl/distributions/scmos/images/bs... , from a casual search) takes its space on die, much like an adder or multiplier. It's all wires and switches.

mansr · on April 4, 2012

The operand constraint should be "r" since bswap works only on registers. You can also simplify it a little using an input/output operand:

    asm ("bswap %0" : "+r"(x));

That said, since version 4.5, gcc recognises the typical byte-swapping pattern and uses the appropriate CPU instruction.

alexchamberlain · on April 4, 2012

This is architecture dependent ofc...

bluesmoon · on April 4, 2012

I noticed ifdefs like this when I inherited some C code back in 2001. I'd always worked on x86 systems, so never really encountered machines with different byte orders. The code was fugly, and I didn't like it, so I studied it some and it hit me that it didn't matter what the byte order was. If I constructed a 32 bit int and assigned it to a 32 bit int, the compiler would take care of the byte order. All I needed to know was the byte order of the network protocol we were using.

Tested new code on my x86 box and it worked. Then just committed to sourceforge CVS and told the rest of the world to test. It worked. My code looked a lot like Rob's.

yason · on April 4, 2012

The smallest questions always cause the most heated debate.

It doesn't really matter much how the possible byte order swap is done: what matters that these ifdefs aren't littered around the code and byte-order swapping is limited to the lowest level where data is actually read from an external source.

I would personally go with his byte array reads as it's less confusing but I would still wrap the functionality inside inlined functions like these:

  uint32_t inline read_be32 (void*);
  uint32_t inline read_le32 (void*);

And then use these whenever reading 32-bit integers from big-endian or little-endian data source.

alexchamberlain · on April 4, 2012

I've tried to tackle this issue at https://github.com/alexchamberlain/byte-order.

dchest · on April 4, 2012

You did this to prove Pike's argument, didn't you?

Whenever I see code that asks what the native byte order is, the odds are about a hundred to one the code is either wrong or misguided.

https://github.com/alexchamberlain/byte-order/commit/b804361...

alexchamberlain · on April 4, 2012

Well caught!

haberman · on April 4, 2012

Nice! I think I'll use this. I was going to suggest also including calls to platform-specific byte swapping functions for compilers other than GCC, but maybe it's better to wait until it's actually demonstrated to be a benefit.

alexchamberlain · on April 4, 2012

I'll quite happily optimise it to work with other compilers... Anything bar x86 might be a problem though...

ableal · on April 4, 2012

It's by Rob Pike. Not a wise choice of target to nitpick. When he says "I guarantee that [...]", I'm inclined to take his word for it.

Nice piece, clearing up a cobweb in a poorly lighted corner. And teaches (with code example) what one really needs to know about handling byte order in data streams.

premchai21 · on April 4, 2012

I wish I had time to write a more thorough response right now, but I just did a short test with Debian sid and its GCC 4.6.3 on a modern Xeon machine under Xen (so, not the best performance testing device, so take this with some salt).

At -O9, the compiler optimizes a masks-and-shifts swap of a uint64_t into a bswapq instruction identical to the one emitted by the GCC-specific __builtin_bswap64; this can be coupled with an initial memcpy into a temporary uint64_t. Loading individual bytes and shifting them in emits a pile of instructions that take up 16 times as much code space and ~35% runtime penalty (2.7 s versus 2 s). This is measured in a loop decoding a big-endian integer into a native uint64_t and writing it to a volatile extern uint64_t global, 2^30 iterations, function called through a function pointer.

Aligned versus unaligned pointers seem to make no real difference on this CPU, using a static __attribute__((aligned(8))) uint8_t[16] and offsets of 0 (aligned) and 5 (unaligned) from the start of the array.

I also tried a function with the explicit cast-shift-or that uses an initial memcpy into a local uint8_t[8] in case the compiler was doing something strange with regard to memory read fault ordering as compared to the explicit memcpy in the two bswapq-generating versions. This resulted in some very "interesting" code that shoves the local array into a register and then very roughly masks and shifts all the bits around, at about a 100% penalty from the bswapq functions. :-(

If anyone's interested in the details, reply and I'll try to put them somewhere accessible, though it may take a little while.

figglesonrails · on April 4, 2012

This isn't surprising. If the set the AC bit on x86, then it will disallow unaligned accesses and you'll be operating in an environment more similar to RISC machines. In order to allow such a thing to succeed, GCC can't produce a 32-bit read from char* address since the alignment is only guaranteed to be 1 (i.e. no alignment) and this would trigger SIGBUS. Thus, in order to get a 32-bit read, you must deref a 32-bit variable, not 4x 8-bit ones. This makes even more sense on RISC systems where this "optimization" would be a tragic bug you'd want to work around in your compiler. See my post with the x86 assembly output confirming your general results.

huhtenberg · on April 4, 2012

Rant-y. I don't like that.

> The byte order of the computer doesn't matter much at all except to compiler writers and the like

Binary protocol parsing is one area that relies heavily on byte ordering, struct packaging and alignment. Tangentially, binary file parsing that is optimized for speed will have the same dependency. In fact, anything that deals with fast processing of the off-the-wire data will want to know about the byte order.

kstenerud · on April 4, 2012

Actually it's kind of funny... I recently wrote a base64 encoder/decoder that makes use of native endian order to build 16-bit unsigned int based lookup tables that map to the correct byte sequence in memory regardless of native endianness.

Looking up a 16-bit int rather than 2 chars, and outputting as a 32-bit int rather than 4 chars yields a nice performance boost at the cost of possibly not being portable for some more esoteric architectures that don't have a 16 and 32-bit unsigned int type.

So while he's right that 99% of the time you shouldn't be fiddling with byte order, it still pays to know how to wield such a tool, and it's most definitely not just for compiler writers.

GoSailTheC · on April 5, 2012

CPU byte order definitely matters to device drivers read/writing across the I/O bus: they must perform wide aligned reads and writes using single CPU loads and stores. Rob's approach simply won't work there. Similarly, OS-bypass networking and video, which expose hardware device interfaces in user space, require CPU-endian aware libraries.

That said, use Rob's portable approach anytime you don't have a compelling reason not to, if only to not have to worry about alignment and portability. Doing otherwise is premature optimization and a maintenance headache.

TwoBit · on April 4, 2012

There's one and only one reason we write code the way he says not to: performance. Working with words instead of byte munging makes a huge performance difference. And in game development performance beats most other reasoning, especially when we are talking about loading tens of thousands of these on startup. And besides, all our code is wrapped in calls to inline functions named uint32_t FromBigEndian(...) anyway, so it's actually cleaner than what he proposes.

sparkie · on April 4, 2012

One fallacy is that you need to ever manually convert byte order yourself in the way the article suggests. Most systems have something that'll do it for you - eg, htonl, ntohl.

ArchD · on April 4, 2012

I don't know why you got downvoted. Your idea is valid and people may not have realized that htonl has friends an relatives that totally make the issue of the article moot.

       #define _BSD_SOURCE             /* See feature_test_macros(7) */
       #include <endian.h>

       uint16_t htobe16(uint16_t host_16bits);
       uint16_t htole16(uint16_t host_16bits);
       uint16_t be16toh(uint16_t big_endian_16bits);
       uint16_t le16toh(uint16_t little_endian_16bits);

       uint32_t htobe32(uint32_t host_32bits);
       uint32_t htole32(uint32_t host_32bits);
       uint32_t be32toh(uint32_t big_endian_32bits);
       uint32_t le32toh(uint32_t little_endian_32bits);

       uint64_t htobe64(uint64_t host_64bits);
       uint64_t htole64(uint64_t host_64bits);
       uint64_t be64toh(uint64_t big_endian_64bits);
       uint64_t le64toh(uint64_t little_endian_64bits);

alexchamberlain · on April 4, 2012

Where are these defined?

gkelly · on April 4, 2012

I found them, on ubuntu, in:

  /usr/include/endian.h

ge0rg · on April 4, 2012

Unfortunately, that only works for network-byte-ordered data. If you have LE data, you still have to do it yourself.

dfox · on April 4, 2012

Main point is that you should have function that converts array of bytes to integer or vice-versa, not something like htonl which either swaps order of bytes in integer or does nothing, depending on endianity of platform.

huhtenberg · on April 4, 2012

Keep in mind that on Windows htonl is a function call, and not an inline or a macro.

mcculley · on April 4, 2012

I think this is mostly correct. Certainly when dealing with streams, it makes code more straightforward to just deal with a byte at a time. But grepping over some old code to see where I've used WORDS_BIGENDIAN, I see cases where I defined a typedef struct for a memory mapped binary data format. That is one place where you would sacrifice performance and clarity by dealing with bytes.

haberman · on April 4, 2012

I prefer the code snippet:

  memcpy(&i, data, sizeof(i));
  i = le32toh(i);  // Or whichever function is correct.

This is easier to read and requires less smarts from the compiler to do the right thing efficiency-wise.

gaius · on April 4, 2012

LSB-MSB enables certain useful addressing modes in the 6502, e.g. fast access to the zero page. Therefore IMHO little-endian is better. I'm not really interested in 16-bit and above :-)

duaneb · on April 4, 2012

Of all the things to quibble about, he chooses the 0.01% of binary I/O that's write once, test once.

coffeeaddicted · on April 4, 2012

But now he needs 2 variables, i and data, while otherwise I can just read in i and swap it afterward on a big-endian machine (assuming I read in little-endian certainly).

alexchamberlain · on April 4, 2012

Multiple variables shouldn't be scary. Storing the same data multiple times is... You just need to use pointers. See https://github.com/alexchamberlain/byte-order.

coffeeaddicted · on April 4, 2012

You realize your code is littered with just those defines which the article wrote are not necessary at all? You are exactly proving my point, you don't need a second variable in the case where you don't need to switch bytes when you use a define. And the trick is to use the define only after you already have the value already in the integer. In his solution you would need the data pointer which you have put into the define _always_.

alexchamberlain · on April 4, 2012

They are necessary to provide optimal code without relying on the optimiser.

yason · on April 4, 2012

You need two variables anyway to make any sense into the code.

Where would you get the value of 'i' if you didn't have 'data' and what would you do with the value read from 'data' if you didn't have 'i' or some equivalent?

coffeeaddicted · on April 4, 2012

I don't need a 'data' variables when I read directly from file. For example: "fread( (void *)&i, sizeof(i), 1, file);". Works also with (packed) structs. For the solution as given in the article I need now a second variable 'data' first to buffer the information I read from file.

edit: I mention this case as it covers around 90% of the cases I've seen on a quick check over a codebase I'm working with (Irrlicht). Swapping endian is nearly always done after reading in the data from a file-stream.

skrebbel · on April 4, 2012

> byte order doesn't matter.

> Let's say your data stream has a little-endian-encoded 32-bit integer. Here's how to extract it (assuming unsigned bytes):

    i = (data[0]<<0) | (data[1]<<8) | (data[2]<<16) | (data[3]<<24);

Wait, if byte order doesn't matter, why do I need to do byte-level array lookups when i'm processing a stream of integers? Oh yeah, because byte order does matter. If byte order wouldn't matter (say, if all computers were 32-bit, had the same byte order and the same endianness), I could just cast the stream to int* and be done with it. I can't, because of byte order. It matters.

Whether you deal with it using byte-array lookups and math or #ifdefs and bitmasks, well, whatever rocks your boat man! Good that you're taking it into account, because byte order matters!

msbarnett · on April 4, 2012

Did you bother to read the article? He writes that native byte order doesn't matter, not, as you botched the quote "byte order doesn't matter".

The byte order of the input data obviously matters, and nothing you've said here disagrees with anything he wrote.

demallien · on April 4, 2012

I think that skrebel is trying to say that doing as the article suggests comes with an unnecessary performance penalty hit if your CPU has the same endianness as the data stream.

fpgeek · on April 4, 2012

Why should there be a any performance penalty? A good compiler (and I've worked with at least one that could) would know the machine's endianness and could optimize away that sequence of selections, shifts and ors when it isn't needed.

alexchamberlain · on April 4, 2012

Ok, but the performance penalty is then at compile time...

alexchamberlain · on April 4, 2012

Lots of downvotes... Do you disagree? If so, why? Or do you think it is insignificant?

I've sat and watch C++ compile for 5 hours... Compile time performance is important too!

skrebbel · on April 4, 2012

wtf? I didn't botch any quote. It's the last words of the article.

overcyn · on April 4, 2012

"...the computer's byte order shouldn't matter to you one bit. Notice the phrase "computer's byte order". What does matter is the byte order of a peripheral or encoded data stream"

alexchamberlain · on April 4, 2012

There are a lot of errors in this article and the code therein.

  i = *((int*)data);
  #ifdef BIG_ENDIAN
  /* swap the bytes */
  i = ((i&0xFF)<<24) | (((i>>8)&0xFF)<<16) | (((i>>16)&0xFF)<<8) | (((i>>24)&0xFF)<<0);
  #endif

This should use uint32_t, it is the best way of getting a platform independent unsigned 32-bit integer, which is what you want here.

It's more code.

Couple more lines of C, yes. No more at the machine level.

It assumes integers are addressible at any byte offset; on some machines that's not true.

Not sure about this one...

It depends on integers being 32 bits long, or requires more #ifdefs to pick a 32-bit integer type.

This is caused by bad code - see above.

It may be a little faster on little-endian machines, but not much, and it's slower on big-endian machines.

It is faster on a LE machine, but not slower on a BE machine - the same code can be used and it's a compile time #ifdef.

If you're using a little-endian machine when you write this, there's no way to test the big-endian code.

Test on a BE machine?

It swaps the bytes, a sure sign of trouble (see below).

No actual facts here...

As pointed out by another commentor, this can be optimised out by the compiler on many platforms.

dfox · on April 4, 2012

The point is, when you use explicit shifting of bytes you have same code that works independently of endianity, the fact that compiler is probably going to generate exactly same code seems to me like good argument to go with the more readable choice (ie. explicit shifts), also variant proposed by article is actually portable C, anything involving casting arrays of one type to arrays of another incompatible type is not.

No modern architecture can access arbitrarily aligned words in memory directly (presence of caches modifies things slightly, but shifts the problem from data bus width to cache line width as unaligned word can still span two cache lines). There are generally two solutions to this: disallow that at CPU level (and handle that by raising SIGBUS), emulate it in hardware by doing two memory accesses for one load or store (which involves significant additional complexity), Intel invented third solution in i386: OS can select between these two behaviors.

alexchamberlain · on April 4, 2012

I would argue we, as a community, need to write a portable, yet optimised, byte order convertors. htobe, htole, htobel, htolel, htobell, htolell, and vice versa.

dfox · on April 4, 2012

Converting byte order of integer is mostly pointless operation (which is what the article tries to say), what is needed is portable, yet optimized way to build/parse portable binary structures. In my opinion there are two reasons why too much optimization in this is complete waste of time:

1) Even if compilers are not able to optimize manual conversion of integer to/from discrete bytes into same code as word sized access with optional byte order swap, it's mostly irrelevant, as there aren't going to be any significant difference in performance between one four byte access and four one byte accesses (as in both cases you end up with same number of actual memory transactions, which is the slow part, due to caches)

2) when you are handling portable binary representation of something, it's always connected to some IO, which is slow already so any performance boost that you get from microoptimalization like this is completely negligible.

I tend to just hand write few lines of C to pack/unpack integers explicitly when needed as it seems to me as the most productive thing you can do.

By the way all the big endian <-> little endian functions you propose boil down to two implementations for each size of operand: no-op and mirroring of all bytes, both of which are mostly trivial.

What is really missing is portable and efficient way to encode floating point numbers, as there is no portable way to find out their endianity and in floating point case it's more complex than just big vs. little endian.

alexchamberlain · on April 4, 2012

It's the boiling down that people get confused with... I've started an implementation at https://github.com/alexchamberlain/byte-order.

chronomex · on April 4, 2012

Please see the "htonl" function and its brethren in <arpa/inet.h>.

alexchamberlain · on April 4, 2012

As stated elsewhere, these do host to big endian and vice versa. No little endian support, though most of us are working on LE machines ofc.

paulsutter · on April 4, 2012

While the article is rant-y and some details sloppy, I have to admit his core point is pretty good. Rather the writing #ifdef'd code, why not write code that works the same regardless of endianness?

That seems like a good idea. Not for -all- the reasons he mentions, but for the simple reason that it's one code path so easier to code and easier to get right.

alexchamberlain · on April 4, 2012

Started a github repo at https://github.com/alexchamberlain/byte-order.

duaneb · on April 4, 2012

If you're using C++, you can dispose of the ifdefs completely and check at compile time the union of a uint16_t and two uint8_ts.

alexchamberlain · on April 4, 2012

The C++ is just an interface... I'm just nice like that!