I don't post numbers anymore because I'd always wind up in arguments with people who simply didn't believe them, or thought I'd unfairly manipulated them, or cherry-picked the test cases, whatever. Hence I encourage you to run your own numbers.
I had some trouble using the ubuntu 13 packages for gdc, so i downloaded it from the gdc project binaries as of the latest available there, as recommended by the readme.
Using that to compile warp with gdc with the flags it suggests (-release is not recognized by gdc, -O3 is), i get a warp that works.
For including every file in /usr/include/boost/*.hpp in one .cc file (which produces roughly 16 megabytes of C++ code), we get:
[dannyb@mainserver 12:40:56] ~ :) $ time gcc -E e.cc >f
In file included from e.cc:101:0:
/usr/include/boost/spirit.hpp:18:4: warning: #warning "This header is deprecated. Please use: boost/spirit/include/classic.hpp" [-Wcpp]
# warning "This header is deprecated. Please use: boost/spirit/include/classic.hpp"
^
gcc -E e.cc > f 3.18s user 0.25s system 97% cpu 3.528 total
[dannyb@mainserver 12:40:51] ~ :) $ time clang -E e.cc >f
In file included from e.cc:101:
/usr/include/boost/spirit.hpp:18:4: warning: "This header is deprecated. Please use: boost/spirit/include/classic.hpp" [-W#warnings]
# warning "This header is deprecated. Please use: boost/spirit/include/classic.hpp"
^
1 warning generated.
clang -E e.cc > f 1.42s user 0.14s system 93% cpu 1.657 total
cla/usr/include/boost/spirit.hpp(18) : warning: "This header is deprecated. Please use: boost/spirit/include/classic.hpp"
./warp/fwarpdrive_gcc4_8_1 -I/usr/include -I/usr/include/c++/4.8 e.cc 2.88s user 0.06s system 95% cpu 3.080 total
I've repeated these timings 10 times, and they are within 0.5% of these numbers each time.
I've also tried this on a large C++ project i have, that generates about 200 meg of preprocessed source (that i can't share, sadly) and got similar relative timings. I also tried it on some smaller projects.
Based on data i have so far, clang blows warp out of the water by a factor of 2 in most cases i've tried it.
The above tests include stdout IO, but the relative numbers are the same without it:
[dannyb@mainserver 12:48:24] ~ :( $ time gcc -E e.cc -o f
In file included from e.cc:101:0:
/usr/include/boost/spirit.hpp:18:4: warning: #warning "This header is deprecated. Please use: boost/spirit/include/classic.hpp" [-Wcpp]
# warning "This header is deprecated. Please use: boost/spirit/include/classic.hpp"
^
gcc -E e.cc -o f 3.14s user 0.27s system 99% cpu 3.418 total
[dannyb@mainserver 12:48:33] ~ :) $ time clang -E e.cc -o f
In file included from e.cc:101:
/usr/include/boost/spirit.hpp:18:4: warning: "This header is deprecated. Please use: boost/spirit/include/classic.hpp" [-W#warnings]
# warning "This header is deprecated. Please use: boost/spirit/include/classic.hpp"
^
1 warning generated.
clang -E e.cc -o f 1.41s user 0.13s system 94% cpu 1.631 total
[dannyb@mainserver 12:48:40] ~ :) $
(I reordered this one to make the timings in the same order as they were before)
[dannyb@mainserver 12:47:38] ~ :( $ time ./warp/fwarpdrive_gcc4_8_1 -o f -I/usr/include -I/usr/include/c++/4.8 -I/usr/include/x86_64-linux-gnu/c++/4.8 -I/usr/include/x86_64-linux-gnu -I/usr/lib/gcc/x86_64-linux-gnu/4.8/include/ -I/usr/lib/gcc/x86_64-linux-gnu/4.8/include-fixed/ e.cc
/usr/include/boost/spirit.hpp(18) : warning: "This header is deprecated. Please use: boost/spirit/include/classic.hpp"
./warp/fwarpdrive_gcc4_8_1 -o f -I/usr/include -I/usr/include/c++/4.8 e.c 2.93s user 0.02s system 99% cpu 2.953 total
gcc: 3.14s user 0.27s system 99% cpu 3.418 total
clang: 1.41s user 0.13s system 94% cpu 1.631 total
warp: 2.93s user 0.02s system 99% cpu 2.953 total
2.31s (with recommended build settings)
Because of different #define's, Warp may take a very different path through header files than other preprocessors do. In fact, Warp doesn't have any predefined macros other than the ones required by the Standard. Hence, to use it with cpp or clang's preprocessor, it needs to be driven with a command that -D defines each macro.
There's a command (I forget at the moment what it is) that will tell cpp to list all its predefined macros. It's quite a few. You'll need to do that for clang to get an equivalent list, then drive Warp with that.
You'll be able to tell if it is taking the same path or not by using a diff on the outputs that ignores whitespace differences.
The reason Warp doesn't predefine all that stuff is because every install of gcc has a different list, and it's completely impractical to try and keep up with all that.
I did in fact, use warpdrive, which uses those predefines, as you can see in the commands.
I'm also familiar with the innerworkings on llvm and gcc (having hacked a lot on both), and generated the list of include paths i used with warpdrive (emulating gcc 4.8.1) to be exactly the same as GCC on my system uses for 4.8.1.
I also verified the preprocessed output is "sane" in each case, as per diff.
Thanks for doing this. I have read that clang uses some SIMD instructions to speed this up, and I don't know how much that contributes. Warp doesn't use any inline assembler.
And, as your numbers show, suggesting the change in compiler flags was entirely justified.
I didn't check to see if the instructions exist, but possibly :)
You do start to hit two issues though as oyu increase the size of the skipping:
1. Alignment
2. If the average block comment/line is < 64 characters, you may lose more time performing the instruction and then counting the trailing zeros in the result to find the place it ended.
I have no numbers to back up whether this matters, of course :)
AVX-512 does not seem to have PMOVMSKB, which is how I assume it is being done with SSE2. There are other ways to skin that cat, but it's unclear whether they have any advantage over using AVX2 with VPMOVMSKB.