> warp is currently able to preprocess many source files, one after the other, in a single command. [...] warp is set up/torn down only once for the entire set of files, rather than once for each source file.
I'd like to learn more about this. I spend a fair amount of time building on HPC systems. Frustratingly, compiling on a $100M computer is typically 50x slower than compiling on a laptop due to the atrocious metadata performance of the shared file system. Configuration is even worse because there is typically little or no parallelism. Moving my source tree to a fast local disk barely helps so long as system and library headers continue to reside on the slow filesystem. A compiler system that transparently caches file accesses across an entire project build would save computational scientists an enormous amount of time building on these systems.
This is why I asked Walter about use of asynchronous idioms in Warp. It ought to help even if there is no parallelism involved. It is perhaps a TODO. Walter would be able to elaborate.
So your build speed is limited by metadata file accesses to network storage? Your network/storage must be really bad then. How about mirroring all the needed headers locally before building? You could establish that as a makefile rule.
Could you describe in a bit more detail how the ranges-and-algorithms style applies to Warp? What are the major algorithms that you are gluing together? What is the high-level design of Warp?
> how the ranges-and-algorithms style applies to Warp?
First off, a text preprocessor is a classic filter program, and ranges-and-algorithms is a classic filter program solution. Hence, if that didn't work out well for Warp's design, that would have been a massive failure.
The classic preprocessor design, however, is to split the source text up into preprocessing tokens, process the tokens, and then reconstitute output text from the token stream.
Warp doesn't work like that. It deals with everything as text, and desperately tries to avoid tokenizing anything. The ranges used are all ranges of text in various stages of being preprocessed. A major attempt is done to minimize any state kept around, and to avoid doing memory allocation as much as possible.
Warp doesn't use much of any classic algorithms. They're all custom ones built by carefully examining the Standard description of how it should work.
So is the main data processing pipeline broken into several decoupled stages ("algorithms") that pass buffer pointer/length pairs between them ("ranges")? Sorry if I'm misunderstanding, I'm not very familiar with these terms as they are used in D.
What are the main stages/algorithms that constitute the processing pipeline?
I'm mainly trying to understand the high-level roadmap/design enough that I can use the source code itself to answer my more detailed questions. :)
The main stages correspond roughly to the "translation phases" described in the Standard. If you're familiar with those, such as \ line splicing, the source code will make more sense. There are also things like "Rescanning and further replacement" mentioned in the spec that are implemented, for example, by macroExpand. I tried to stick with terminology used in the Standard.
Since Warp is meant to be a drop-in replacement for GCC's cpp, do you plan to include a traditional (pre-standard) preprocessing mode? This has been a source of some agony in FreeBSD, preventing clang's cpp from fully replacing GCC cpp on some ports (mostly X11 related due to imake). So far the partial solution has been ucpp.
It seems that it's not so much about pre-1989 "code," but rather Makefiles. ANSI C preprocessors like yours will convert tabs to spaces, and Make doesn't like that. It seems reasonable to let this use case be handled the way it always has been: by using an old tool that still works.
The other major area of difference seems to be that pre-ANSI people used foo//bar to paste tokens (whereas now we use ##). If we're talking about C, that's easy to update; apparently some Haskell folks can't do that for their own reasons. Again it's a use case which is not preprocessing of C or C++, so it seems OK to ignore it if you're implementing a C preprocessor (as opposed to a generic macro expander usable with Makefiles and Haskell).
I wrote a 'make' program in the 80's (I still use it http://www.digitalmars.com/ctg/make.html) and it has a macro processor in it, but it is not like the old C preprocessors.
Warp does support the obsolete gcc-style varargs, but other obsolete practices it discards.
In any case, I haven't seen any Makefiles big enough to benefit from faster preprocessing.
According to your bio you're a lapsed ME, like myself. Do you think not coming from a CS background gives you an opportunity for novel solutions in compiler development work?
[My first real project used Zortech C++ for OS/2, thanks for the fond memories...]
I think not coming from a CS background has resulted in me reinventing the wheel on several occasions. (I thought I'd had a brilliant insight, only to be told it was well known.)
I'm credited with a few innovations (like NRVO), but for the most part I just put together novel combinations of other peoples' ideas :-)
On a related topic could you tell us about coroutines/fibres in D ... How are they implemented ? Can they be called from C ? are they are used in D's standard library, examples of a few notable use cases (I guess one would be Vibe.d), examples of asynchronous idioms in D.
Since that's a lot of questions, pointing to documents would be fine too.
Do you employ any detection of boost preprocessor library use, so that e.g. BOOST_PP_FOREACH gets processed as a literal loop rather than a large expansion?
Nope. Boost is just another bucket of chum as far as Warp is concerned. I did use Boost extensively to validate correct operation of Warp and profile execution speed.
Hard problems for C++ compilers are keeping track of where a piece of code originated and providing good diagnostics in code expanded from macros. Can warp help with this or was speed such a paramount goal that the rest of the tool-chain cannot provide good diagnostics as soon as macros are involved?
Will Warp's design allow modes to be added, where it can be be used as a preprocessor replacement for compilers other than gcc (msvc, clang)? Would you accept such patches? I know mcpp (the only other standalone c preprocessor I've seen) tried to do this.
Other than the speed processing, is there anything about how Warp operates that would make it easier to implement a distributed build system? It seems like the preprocessor can sometimes play a role in whether builds are deterministic enough to be successfully cacheable (__FILE__ macros and such).
Warp only predefines the macros required by the Standard. For all the other macros predefined by cpp, it is passed those definitions on the command line via a script file.
The same should work for any other compiler, provided they don't use preprocessors with custom behavior.
I don't see any reason why Warp cannot be used in a distributed build system.
I find it a bit weird that the top comment in this story fawns over several star programmers, but ignores the guy who actually wrote the tool that the story is about.
Those are the greats today, and with a talent pool like that it's likely that many future greats will be created there. You have to do more than write good code to be legendary.
Maybe Facebook might end up being the Xerox PARC of our modern times.
There was a time at Xerox PARC when you couldn't throw a book without hitting someone who was or would become an important pioneer.
I'm not trying to discredit what Facebook is doing, or the people there, but it'll take time to build up to that level of talent. Having a few remarkable individuals is a great start. Having an entire department filled with them is going to be hard work.
Nothing compares to Bell Labs. Google is an amazing software and networking company, however the majority of its contributions are in the field of Computer Science, specifically in distributed computing and networking.
Bell Labs made significant contributions to Physics (solid-state and optics), Chemistry, Electronics, Computer Science (operating systems, graphics, speech recognition, networking), Materials Science, Communications (here they pretty much invented an entire field of study) and the design of so many everyday things (they had an entire team that focused on just the design of your telephone wire) that is would be impossible to imagine what the world would be like if they didn't exist.
If Google didn't exist, search would still have been solved, maybe a decade or so later. If Claude Shannon didn't work at Bell Labs when he did, it could well be the case that we wouldn't have come up with such an elegant theory of Information even today, and therefore the world would look nothing as it does.
EDIT: Unsure why my parent comment was deleted, but I was suggesting that trading cards of famous programmers with their accomplishments on the back would be fun(ny).
But Warp is designed to be a drop-in replacement. It doesn't produce char-by-char exactly the same output as cpp, as the whitespace differs, and the decisions about when/where to produce linemarker records are different.
But the output is functionally identical (any differences are bugs in either Warp or cpp).
That fits well within my use of the term "reimplementation." I appreciate the specificity of your response, though - always best to answer every possible question when you're not sure which is meant.
Top-of-trunk clang from http://llvm.org/apt/ takes ~2.1 seconds on my machine to preprocess the file to /dev/null. warp (compiled with -O4 -frelease -fno-bounds-check) takes ~2.8 seconds.
So this test case is faster with clang even without precompiled headers. It is hard to make a benchmark for clang's precompiled headers because the AST is lazy-loaded from the PCH. You would have to actually have code use values from the header.
EDIT: I forgot another advantage of clang: the preprocessing and compilation and assembly are all done within the same process, eliminating process creation overhead.
When they're usable for C++, there should be little reason to use any special preprocessor, since every header file (not just a static common set) need only be compiled once into a binary format rather than being included into N source files. It can't happen any sooner for me... but as of recently, they're still very broken, so I'm still waiting.
Clang modules for C++ are going nowhere as long as the concerned working group of WG21 isn't proposing anything.
While it seems that everybody has almost the same idea how modules should look as part of the language almost no one can agree on how they should be specified or what part of a module should actually be specified at all.
I was under the impression that clang is not waiting for standardization to implement this stuff (indeed, C frameworks on OS X already ship module.map files). Am I wrong?
AFAIK the main reason for implementing modules for C++ was to show that it can be done, which makes standardization of a major feature much more likely.
As long as there is only one compiler and no guarantee that you will not have to change everything once again as standardization is complete no one is going to touch a large cross-platform codebase and add module support.
Well, if the performance gain is enough, I certainly would. (The codebase I'm thinking of is not huge, but large enough that I think modules would make a significant compile speed difference.) I think that with the current implementation, if your header files are sane (no depending on previously included files) you can autogenerate a module map file, one module per header, and have it just work. But I'm not sure if there are any wrinkles, because in the current state all I was able to achieve was clang crashing.
Sure, but is that using PCHs? And how much of that time is preprocessing? I don't think the parent question is whether clang can compile any project in less than 3 hours. I think the interesting question is if warp is more compelling than clang from a preprocessing perspective.
I mean, I do too, but they take even longer with gcc. I can remember when compiling could take a full day. These metrics mean nothing without comparison. Clang is hands down the fastest c/++ compiler I've ever used.
The compile speed difference between a modern GCC and Clang/LLVM is nowhere near that dramatic, this comparison you linked is against GCC 4.21 and GCC 4.0 which are 7 and 9 years old respectively.
I really only meant to refer to the -fsyntax-only portion which is all that matters when it comes to C preprocessor performance. I know GCCs codegen has steadily improved but I'm not aware of any work to speed up the parser.
Which other languages use the C/C++ preprocessor? I've seen it used for generating data or other source code when a preprocessor comes in handy, but never as a full-fledged component of another language.
I also think that a modern language that uses includes instead of modules is just outright insane.
Objective-c introduced `#import`, which guards from double including by default. I would suspect and performance gains are tempered with a "pure" objc project.
The Glasgow Haskell Compiler has a language pragma for running CPP before compilation. I've seen it used a few times, for #ifdef compatibility between Windows and Linux for low-level stuff. I don't think anyone would ever want to use #include or macros, so the issue of speed is not very interesting in that context.
Could you substantiate this claim? I don't necessarily not believe you, but it's an easy claim to make without any quantifiable argument. It's also hard to test against eg clang, where preprocessing is performed in the same process as the compiler and assembler, and the performance of a preprocessor is really less interesting than total compile time.
Let me clarify—could you substantiate the performance claims? I have no doubt it's a correct preprocessor, but I do have doubts about performance claims in the context of the compilation of a project. cpp (and the rest of the gnu compiler toolchain) itself is a pretty terrible example of a well-written program, but clang is a well-written, holistic, performance-centered program, and it would strike me as difficult for an isolated preprocessor to improve significantly on it compared to the built in one.
Or another way of putting this is—you speak compellingly about both the algorithm side and the constant side (no tokenizing), but you don't speak about "real world" performance when interacting with many levels of caches, processes, and filesystems. Could you speak to this at all?
Warp is written in D, by the guy who created the language. It doesn't seem weird that Andrei would want to talk about that (especially considering the fact that I assume he's very much interested in seeing D in use in more places).
I had the same question about how it compares to clang though. I suppose now that it's open source, someone can do some testing.
" I assume he's very much interested in seeing D in use in more places"
Yes, I think you are correct and this is what I was getting at. It seems like a propaganda exercise to promote D, you only have to look at the end of the article to see this "And join the D language community for the D Conference 2014 on May 21-23 in Menlo Park, CA."
For the record I am very much aware of who Andrei is and the influences he has on certain languages.
I'd be curious to see how clang does, too. What matters to us is that warp is easy to get into so we can easily adapt it to our build system (in particular multithreaded preprocessing that saves on opening same included files multiple times).
I don't post numbers anymore because I'd always wind up in arguments with people who simply didn't believe them, or thought I'd unfairly manipulated them, or cherry-picked the test cases, whatever. Hence I encourage you to run your own numbers.
I had some trouble using the ubuntu 13 packages for gdc, so i downloaded it from the gdc project binaries as of the latest available there, as recommended by the readme.
Using that to compile warp with gdc with the flags it suggests (-release is not recognized by gdc, -O3 is), i get a warp that works.
For including every file in /usr/include/boost/*.hpp in one .cc file (which produces roughly 16 megabytes of C++ code), we get:
[dannyb@mainserver 12:40:56] ~ :) $ time gcc -E e.cc >f
In file included from e.cc:101:0:
/usr/include/boost/spirit.hpp:18:4: warning: #warning "This header is deprecated. Please use: boost/spirit/include/classic.hpp" [-Wcpp]
# warning "This header is deprecated. Please use: boost/spirit/include/classic.hpp"
^
gcc -E e.cc > f 3.18s user 0.25s system 97% cpu 3.528 total
[dannyb@mainserver 12:40:51] ~ :) $ time clang -E e.cc >f
In file included from e.cc:101:
/usr/include/boost/spirit.hpp:18:4: warning: "This header is deprecated. Please use: boost/spirit/include/classic.hpp" [-W#warnings]
# warning "This header is deprecated. Please use: boost/spirit/include/classic.hpp"
^
1 warning generated.
clang -E e.cc > f 1.42s user 0.14s system 93% cpu 1.657 total
cla/usr/include/boost/spirit.hpp(18) : warning: "This header is deprecated. Please use: boost/spirit/include/classic.hpp"
./warp/fwarpdrive_gcc4_8_1 -I/usr/include -I/usr/include/c++/4.8 e.cc 2.88s user 0.06s system 95% cpu 3.080 total
I've repeated these timings 10 times, and they are within 0.5% of these numbers each time.
I've also tried this on a large C++ project i have, that generates about 200 meg of preprocessed source (that i can't share, sadly) and got similar relative timings. I also tried it on some smaller projects.
Based on data i have so far, clang blows warp out of the water by a factor of 2 in most cases i've tried it.
The above tests include stdout IO, but the relative numbers are the same without it:
[dannyb@mainserver 12:48:24] ~ :( $ time gcc -E e.cc -o f
In file included from e.cc:101:0:
/usr/include/boost/spirit.hpp:18:4: warning: #warning "This header is deprecated. Please use: boost/spirit/include/classic.hpp" [-Wcpp]
# warning "This header is deprecated. Please use: boost/spirit/include/classic.hpp"
^
gcc -E e.cc -o f 3.14s user 0.27s system 99% cpu 3.418 total
[dannyb@mainserver 12:48:33] ~ :) $ time clang -E e.cc -o f
In file included from e.cc:101:
/usr/include/boost/spirit.hpp:18:4: warning: "This header is deprecated. Please use: boost/spirit/include/classic.hpp" [-W#warnings]
# warning "This header is deprecated. Please use: boost/spirit/include/classic.hpp"
^
1 warning generated.
clang -E e.cc -o f 1.41s user 0.13s system 94% cpu 1.631 total
[dannyb@mainserver 12:48:40] ~ :) $
(I reordered this one to make the timings in the same order as they were before)
[dannyb@mainserver 12:47:38] ~ :( $ time ./warp/fwarpdrive_gcc4_8_1 -o f -I/usr/include -I/usr/include/c++/4.8 -I/usr/include/x86_64-linux-gnu/c++/4.8 -I/usr/include/x86_64-linux-gnu -I/usr/lib/gcc/x86_64-linux-gnu/4.8/include/ -I/usr/lib/gcc/x86_64-linux-gnu/4.8/include-fixed/ e.cc
/usr/include/boost/spirit.hpp(18) : warning: "This header is deprecated. Please use: boost/spirit/include/classic.hpp"
./warp/fwarpdrive_gcc4_8_1 -o f -I/usr/include -I/usr/include/c++/4.8 e.c 2.93s user 0.02s system 99% cpu 2.953 total
gcc: 3.14s user 0.27s system 99% cpu 3.418 total
clang: 1.41s user 0.13s system 94% cpu 1.631 total
warp: 2.93s user 0.02s system 99% cpu 2.953 total
2.31s (with recommended build settings)
Because of different #define's, Warp may take a very different path through header files than other preprocessors do. In fact, Warp doesn't have any predefined macros other than the ones required by the Standard. Hence, to use it with cpp or clang's preprocessor, it needs to be driven with a command that -D defines each macro.
There's a command (I forget at the moment what it is) that will tell cpp to list all its predefined macros. It's quite a few. You'll need to do that for clang to get an equivalent list, then drive Warp with that.
You'll be able to tell if it is taking the same path or not by using a diff on the outputs that ignores whitespace differences.
The reason Warp doesn't predefine all that stuff is because every install of gcc has a different list, and it's completely impractical to try and keep up with all that.
I did in fact, use warpdrive, which uses those predefines, as you can see in the commands.
I'm also familiar with the innerworkings on llvm and gcc (having hacked a lot on both), and generated the list of include paths i used with warpdrive (emulating gcc 4.8.1) to be exactly the same as GCC on my system uses for 4.8.1.
I also verified the preprocessed output is "sane" in each case, as per diff.
Thanks for doing this. I have read that clang uses some SIMD instructions to speed this up, and I don't know how much that contributes. Warp doesn't use any inline assembler.
And, as your numbers show, suggesting the change in compiler flags was entirely justified.
I didn't check to see if the instructions exist, but possibly :)
You do start to hit two issues though as oyu increase the size of the skipping:
1. Alignment
2. If the average block comment/line is < 64 characters, you may lose more time performing the instruction and then counting the trailing zeros in the result to find the place it ended.
I have no numbers to back up whether this matters, of course :)
AVX-512 does not seem to have PMOVMSKB, which is how I assume it is being done with SSE2. There are other ways to skin that cat, but it's unclear whether they have any advantage over using AVX2 with VPMOVMSKB.
A comparison against VC++ would also be interesting. I believe VC++ is commonly used on Windows, which I hear is a somewhat popular platform. (I suppose warp might not run on Windows though.)
In fact I developed Warp on Windows. It compiles and works fine. Warp source code is completely portable between Linux and Windows, with the following exceptions:
1. wchar_t is ushort on Windows, uint on Linux.
2. Warp uses a slightly modified file reader from the one in the D standard library, customized to the platform.
You know of a project that takes 3 hours of time to preprocess with clang?
I have serious doubts. Overall compilation time is kind of irrelevant to this discussion, because Warp is just a preprocessor.
While you can get some X speedup to gcc by replacing the preprocessor, X, as a factor of overall compilation time, is usually 0.2-0.5 in most cases, depending on size of file.
I expect the gains warp gets over gcc overall from preprocessing to be similar to those clang gets over gcc overall from preprocessing.
(Though it depends on the size of files being compiled, etc).
Most companies that want actual fast overall compilation and have the resources, build caching distributed compilation infrastructure (Google, Facebook).
As mentioned, if warp is really that much faster than clang's preprocessor that it mattered, clang would be fixed :)
What is this even supposed to mean? Is the majority of that time spent preprocessing? How do other compilers perform? Your anecdote alone is absolutely useless.
Preprocessing with clang would be very tricky in a system that uses gcc for the compilation proper (which we do). Clang adds its own predefined #defines, which may change the resulting code.
Sure, if you are stuck with GCC, and have no plans to move, it makes sense to improve GCC's preprocessor for a number of reasons (warp produces cacheable artifacts, etc).
But it's actually not that tricky, since you can just change the defines it makes (after all, if you are maintaining your own toolchain, you are maintaining your own toolchain).
In fact, you are already doing it with warp to emulate GCC's defines.
Suffice to say, we've done it before to provide clang diagnostics but build with GCC.
Yes, and no offense, but while other large companies have folks working on fixing this in public, i don't see facebook trying.
This is really not meant as a dig (really!), but more of "why i figured facebook was not trying to make a transition". The companies trying to do so, are contributing heavily to LLVM to make that transition :)
You can undefine predefined macros in clang using -U. I took a little time to write some commands that should produce something close to the right defines/undefines to let one compiler's preprocessor masquerade as another compiler's preprocessor:
I appreciate this piece of feedback. I've been consciously trying to keep the article interesting technically and avoid it being construed as an advertisement for D, to the extent a couple of coworkers and at least one other colleague (http://goo.gl/QZ5ELn) were unclear about warp being written in D at all. I'll do my best to tone things down further in the future. There's plenty of exciting stuff going on to be worth avoiding alienating people.