It'd be interesting to see these compared to locking versions. The benchmarks, particularly on windows, show that the library is young and still needs some algorithmic tweaks, but in any case it's great that there's an open source implementation of algorithms that have a reputation of being difficult to get correct.
Presumably the benchmarks would indicate immaturity by being slow; but it's not obvious what the benchmarks can be compared with. Are there other implementations of the same algorithm, with identical benchmarks?
It uses either compare_and_swap or load_linked/store_conditional. I thought compare_and_swap had a horrible (i.e. very slow) implementation on x86. Any idea whether load_linked/store_conditional is any better?
CAS is slow because it acts as a serializing instruction. If you depend on success or fail the instructions after it have to wait until it has been resolved. Also, the update has to be pushed out to the cache at least.
So yes, slow but otherwise it would be of no use.
load_linked and store_conditional are not much better. Depending on implementation they can spuriously fail because something poked the cache, there was memory traffic etc.