Hacker News new | past | comments | ask | show | jobs | submit login

Practical example: objc_msgSend (which is called for every message send in an Obj-C program) is written in assembly so it can jump to the target method implementation without disturbing any caller-save/callee-save register state: http://www.friday.com/bbum/2009/12/18/objc_msgsend-part-1-th... (and also presumably because C won't guarantee that a tail call is actually compiled as a jump).

Practical example: objc_msgSend (which is called for every message send in an Obj-C program) is written in assembly so it can jump to the target method implementation without disturbing any caller-save/callee-save register state: http://www.friday.com/bbum/2009/12/18/objc_msgsend-part-1-th... (and also presumably because C won't guarantee that a tail call is actually compiled as a jump).

Assembly is more relevant today than it has been in awhile, largely because Intel has gone pretty gung-ho with vector instructions that compilers mostly can't figure out how to emit on their own.

For example, say I want to convert an array of bytes into a bitmask such that every zero-valued byte is converted into a set bit. On a 64-bit system, you can do this 64-bytes at a time:

    unsigned long bits = 0;
    for(unsigned char i = 0; i < 64; ++i) {
        bits |= ((unsigned long)(bytes[i] == 0) << i);
    }
This isn't a particularly efficient use of the CPU. If you've got a CPU that supports AVX2, you've got two very powerful instructions available to you: VPCMPEQB and VPMOVMSKB. VPCMPEQB will compare two YMM registers for equality at a byte granularity and for every byte in which the registers are equal, will set the destination register to all ones. VPMOVMSKB will take the high bit of each byte of a YMM register and store the result in a GPR. Since YMM registers are 256 bits (32 bytes), you can reduce the entire loop above into just a handful of instructions: two vector loads, two pairs of VPCMPEQB (against zero) and VMOVMSKB, and an OR. Instead of a loop processing data a byte at a time, you can have straight-line code processing data 32 bytes at a time.

A 4th generation Core CPU (Haswell) has a tremendous amount of bandwidth. It can do (2) 32-byte loads per clock cycle, 2 32-way byte comparisons per clock cycle, etc. If you're writing regular C code, dealing with 8-byte longs or 4-byte ints, you're leaving much of that bandwidth on the table.




FWIW, those particular instructions have been around since the Pentium 3. Most of the recent advances have been in width (64 -> 128 -> 256 -> ...) and execution units of that actual width.


It's good to know about SIMD instructions, but you can usually get at them with intrinsics (if you're writing in C) without dropping down to pure assembly.


You can, but if you want to optimize things like register usage it's sometimes easier to just write the whole function in assembly.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: