For those who are still wondering the actual reason for the extra instruction after reading all that, it has to do with the calling convention: when calling a variadic function in SysV AMD64, AL holds the number of vector registers used for parameters. I believe the Microsoft x64 one doesn't do that.
> So Clang can potentially save you a single instruction (xorl %eax, %eax) whose encoding is only 1B, per function call to functions declared in the style f(), but only IF the definition is in the same translation unit and doesn’t differ from the declaration, and you happen to be targeting x86_64.
Also, a xor r32, r32 is 2 bytes, not 1.