Contention is not binary, or, rather, its effects aren't binary. A lock that is ...

haberman · on Oct 9, 2013

> A lock that is held for 10 microseconds potentially causes a lot more contention than a CAS that takes a few nanos.

See, if you're trying to compare those two things, you're comparing a lot more than locking vs lock-free; you're comparing two completely different designs. All you can do with a single CAS is a very primitive operation like queue enqueue/dequeue, stack push/pop, set add/remove, etc. If you're holding a lock for 10us you're doing a whole lot more than that; you're doing something that isn't possible to do lock-free, and your design is inherently less concurrency-friendly.

If your experience comes from taking a coarse-grained mutex-heavy design in C++ and reimplementing it in Java with a design that communicates between threads only with lock-free data structures, then your results aren't indicative of locking vs. lock-free or C++ vs. Java.

If you really want to compare locks vs. lock-free apples-to-apples, then use data structures that have the same API as your lock-free data structures, but are implemented by taking a lock for a very short amount of time -- just enough time for the operation you would have performed lock-free.

And if you want to compare C++ vs. Java, use an equally concurrency-friendly design with both. If you're using high-performance lock-free data structures in Java, use the same in C++, since they are available in both. The memory management complications can be confined to the container library; at the application level exclusive ownership of objects can be passed from thread to thread.

pron · on Oct 9, 2013

Obviously. But lock-free data structures in C++ are not as good as their Java counterparts (because of GC), and besides, most lock-free DS are often implemented in C++ several years after they're implemented in Java, if at all.

For example, Java's nonblocking hash map (by Cliff Click) is still not available for C++, 6 years after its release. There are a few very limited/half baked implementations, but nothing "canonical" (and nothing so thoroughly researched, optimized and used).

BTW, part of the reason is that the world's leading low-level concurrency researchers and implementers (Doug Lea, Nir Shavit and others) do most of their work in Java (most of my own work is designing non-blocking data structures, BTW; not that I'm comparing myself to Lea or Shavit in any way).

To get a sense for the quality of Java data structure implementations, I strongly suggest you watch this talk by Doug Lea: https://www.youtube.com/watch?v=sq0MX3fHkro

His work is incredible. Even if you don't use Java, there is so much to learn in the talk. One of the coolest things there (which I already hinted at), is how to spin properly on modern Intel processors.

haberman · on Oct 9, 2013

I appreciate the link to the talk, I learned a lot. It just reinforces my beliefs though; so much of the talk is about having to work around the idiosyncrasies of the JVM. Quotes from the talk:

4:15 "The main moral of this talk is that things are never what they seem, and they are increasingly not even close to what they seem as we have more and more layers and systems. So there's you on top of core libraries who live on top of JVMs, and the reason I know all the JVM folks so much is because I yell at them pretty much all the time."

10:30 "Oh my god, [because we derive from java.util.Collection] we have put more lines of code into supporting a method that nobody in their right mind would call for a concurrent queue, than almost everything else combined. Between iterators and interior remove, that's probably 80% of our concurrent queue code. The algorithms themselves, they're maybe 20 lines, they're cool but they're not a lot of code."

0:57:10 "How do you get a good CAS loop? CAS until success? Don't do while, do do/while! Why? Safepoints! Remember safepoints I mentioned long ago? Yeah, they're not your friend. [...] You know the great thing about JITs is that they work so well, and the bad thing about JITs is that, you know, sometimes they're idiots! [...] Boxing, oh my god I hate boxing. Classloading appearing at mysterious times."

Maybe to you the JVM gives enough benefits to be worth giving up all this control, but to me it it's just nuts that you'd put yourself on top of a platform that you have to work against so much.

And I still don't buy that GC makes for better lock-free data structures. Lock-free data structures may need to have a GC-like scheme internally to memory-manage their nodes, but I can't see any reason why this requires the entire program to be subject to GC.

pron · on Oct 9, 2013

> I can't see any reason why this requires the entire program to be subject to GC.

It doesn't. Lots of Java low-level programs (DBs, caches etc.) manage most of their heap manually. But an "internal" GC would have to be just as good as a "whole program" GC anyhow in order to make the non-blocking data structure robust (I generally don't use the term lock-free, as it's a very specific case of a non-blocking data structure; many interesting nonblocking DSs are only obstruction-free rather than lock-free).

Also, GCs have other advantages, such as a much higher memory allocation-reclamation throughput than any general manual scheme, especially (though not only) in the multi-threaded case. Many GCs may introduce pauses, but all the good ones would give you a much higher throughput than anything you can achieve manually.

> Maybe to you the JVM gives enough benefits to be worth giving up all this control, but to me it it's just nuts that you'd put yourself on top of a platform that you have to work against so much.

Yes, to anyone doing hardcore multicore programming, the JVM is probably the best choice. You are not giving up control, though, because C++ has far worse problems in those areas (forget about safepoints - C++ doesn't even have a memory model, or didn't until C++11, and most C++ programmers don't even know what ordering guarantees their particular compiler provides). Also, if you think you're not "giving up control" in C++ (or C; or Assembly), then you're sorely mistaken. You have little or no control over ILP, prefetching, branch-prediction, outstanding cache misses and all the magic modern CPUs do at runtime. You could think of the JVM as an extension of the CPU only in software – modern CPUs do so much crazy stuff to your code anyway, that you can't possibly know how exactly it will be executed even if you program directly in machine code. You can no longer get truly close to the metal, so being one step rather than two steps removed is not much of an advantage any more.

> "Oh my god, [because we derive from java.util.Collection] we have put more lines of code into supporting a method that nobody in their right mind would call for a concurrent queue, than almost everything else combined. Between iterators and interior remove, that's probably 80% of our concurrent queue code.

Sure. The code is often complex to support moot use-cases. Still, it's leaps and bounds beyond anything in pretty much any other language. Again, the reason is mostly coincidental. From its beginnings, more or less, Java (or Sun) has attracted the most prominent concurrency researchers. Java 1.5 was the first general purpose programming environment with a memory model, iron-level access to concurrency primitives (fences, CASs), and some world-class GCs, all requirement for state-of-the-art concurrency research. To this day, concurrency researchers prefer the JVM, despite its annoyances (and what language doesn't have those?) to any other platform out there. That's why most concurrent DSs get implemented in Java first, and their Java implementation is usually the most polished and robust.

But to quickly go back to where we started, a higher-level PL often brings about performance improvements. There are famous examples of sorting algorithms in C++ easily outperforming their C implementations. Why? Because of templates vs. function pointers. More precisely - due to inlining, as inlining is "the mother of all optimizations" (it opens the door to other optimizations). Now, of course you could achieve the same in C, by writing the sort function many times, but it's cumbersome and error prone.

The JVM is "the inlining magic machine". It inlines code that a C++ compiler never could. There are some areas where C++ is more performant, namely computations on arrays of complex numbers, but hopefully that will be addressed in Java 9. But this inlining capabilities of the JVM is why it often outperform C++ even in single-threaded code. Again, you could achieve the same in C++ by writing the same code several times (or by generating machine code at runtime, which is what the JVM does). Also, this effect is usually perceived in larger applications, because then you use virtual methods.

Anyway, that's pretty much what I have to say on the matter. I have a lot of respect for C++. I was a C++ programmer for many years. The JVM brings many interesting performance improvements over C++, as well as performance regressions in comparison. All in all, Java usually comes up ahead in large codebases, and falls behind in specific use cases (mostly numeric HPC). In addition, there is no other programming language/platform out there that even comes close to the JDK's power when it comes to hardcore concurrency; some successful implementations find their way to TBB a few years later, and some don't. If you're trying to push the envelope on concurrent software development (my day job), Java is pretty much the only sensible way to go.

Still, there are valid reasons to prefer C++ over Java in many circumstances, but sheer performance is seldom one of them.

haberman · on Oct 10, 2013

> Also, GCs have other advantages, such as a much higher memory allocation-reclamation throughput than any general manual scheme

A benefit of using C/C++ is that you are not limited to general schemes. If you have an allocation/deallocation hotspot, you can use more specialized schemes like arena allocation, which will beat a GC handily.

Also, latency and predictability are often more important than throughput. GCs fare poorly on these counts.

> forget about safepoints - C++ doesn't even have a memory model, or didn't until C++11

Aka it does have a memory model. And one of the points Doug Lea made in the talk you linked is that the Java memory model is actually broken and no one knows how to fix it, whereas C++'s is not (though it's broken in ways that apparently don't affect anyone).

> Also, if you think you're not "giving up control" in C++ (or C; or Assembly), then you're sorely mistaken. You have little or no control over ILP, prefetching, branch-prediction, outstanding cache misses and all the magic modern CPUs do at runtime.

This is a tired and weak argument.

A mispredicted branch costs 5ns. A main memory reference (cache miss) costs 100ns. And you actually have a lot of control over these things; you can make your branches more predictable and the working set of your tight loops smaller so it will fit in cache. And there are actually instructions that give control over prefetching.

A Java GC pause on the other hand is often on the order of 200ms. That's six orders of magnitude in difference, and can be even longer in degenerate GC situations. There is really no bound to the worst case.

The loss of control also comes from the fact that the effects are highly nonlocal. If your C/C++ code is in a tight loop, the microarchitectural effects are mostly a function of your own code and its behavior. But whether a safepoint triggers a GC pause in Java is a function of the entire program's behavior.

So you really can't equate these two things. Java's loss of control is many orders of magnitude larger and less local.

> But this inlining capabilities of the JVM is why it often outperform C++ even in single-threaded code.

I don't think you've made this case at all.

> All in all, Java usually comes up ahead in large codebases

Coincidentally it is large codebases where it is most difficult to compare apples to apples, because it is so unlikely that you can really get two large programs that do the same thing with the same design. I'm deeply skeptical of this claim.

> and falls behind in specific use cases (mostly numeric HPC).

Java falls behind it tons of cases. It loses in nearly all of the benchmarks game, only some of which are numeric: http://benchmarksgame.alioth.debian.org/u64q/java.php It loses very badly in this benchmark published a few years ago by Google, which is not numeric: http://static.googleusercontent.com/external_content/untrust...

Given comparable programs, C++ is nearly always faster. You really haven't offered any evidence that contradicts this basic fact. The only benchmark you cited didn't even have any C++ programs in it.

pron · on Oct 11, 2013

> A benefit of using C/C++ is that you are not limited to general schemes. If you have an allocation/deallocation hotspot, you can use more specialized schemes like arena allocation, which will beat a GC handily.

Absolutely. And you can do the same in Java with about the same amount of effort (lots of Java projects do that).

> it does have a memory model.

It does, and it probably fixes some mistakes in the JMM, but C++ got it, what, 10 years after Java? That's about how far behind C++ is in terms of concurrency. I know this because that's my business (concurrent data structures and high concurrency on many-core hardware in general - you really can't do C++ well on a many-core system; it doesn't scale unless it's embarrassingly parallel).

> A Java GC pause on the other hand is often on the order of 200ms.

It could actually be much worse. Fortunately, you could get a commercial GC that offers exactly 0 pauses (in fact, I suggest my customers do exactly that if they need 0 pause but fall short of true hard real-time; for hard real-time there's real-time Java with built in arena allocation with correctness guarantees of no pointers from the heap into the arena).

> Java falls behind it tons of cases.

I know I won't be able to convince you, but those benchmarks are completely irrelevant. I can easily make any single-threaded algorithm perform better in C/C++. But, as I've said in the beginning, Java comes out ahead when you have a large codebase with many team members, which requires software engineering that precludes good optimizations. This realization is a result of many years experience with both Java and C++. I don't have much hard evidence (some, but it's possibly anecdotal and also proprietary), but in our field, as in medicine, sometimes hard-earned first hand experience means a lot.