Pessimism about parallelism

pcwalton · on Dec 26, 2018

This pessimism about whether software can effectively use CPU cores beyond two is easily refuted with a simple observation: High-end games are now running pretty close to 100% CPU utilization on octa-core console hardware.

Parallelizing our software is mostly a question of investment at this point. In my experience, there isn't a lot of software that wouldn't benefit from CPU parallelism in some way or another. The bigger issues are (a) that a lot of software simply doesn't prioritize performance, for both good and bad reasons, and (b) that a lot of foundational software has been in maintenance mode for years and there isn't enough interest or funding to improve it.

twotwotwo · on Dec 26, 2018

Yeah, what it takes to really, truly stop parallelism is an unbreakable linear chain of dependencies, like how LZ decompression needs the previous unpacked block to begin work on the next. (Not that LZ unpacking is maxing out people's CPUs these days.) That certainly exists sometimes but isn't nearly common enough to make me doubt the whole enterprise of parallelizing things.

The good news is that using all cores is one of the top things you can do to get work done sooner, and as such we probably will see investment happen. My work is pretty different from AAA games, but still parallelizing a big task is not infrequently lower hanging fruit than tuning the code for one unit of work. We've paid for the whole machine (or cluster) so we might as well use it.

Browsers are an interesting instance because they have a bunch of heterogeneous tasks with weird dependencies between them, and are huge, old codebases where you have to be careful adding threading, but they still manage to make real use of threads (e.g. running expensive compilers, decoding images, rendering, etc.).

Suspect you're saying ~this re: "foundational software", but as languages, tools, and libraries keep adapting to modern hardware, I think using parallelism confidently will become easier and we'll see more of it. Ecosystems change slowly but they do change.

ESR's claim about locality particularly seems iffy. A common task with very poor locality but useful parallel implementations is garbage collection: it thrashes your CPUs' caches, but thrashing all of them at once can still get it done sooner.

terrelln · on Dec 26, 2018

I definitely agree that parallelism isn’t held back by fundamentally serial work. Even with LZ decompression, most algorithms entropy encode the LZ sequences, and these sequences can be decoded in parallel with the LZ decoder.

MaulingMonkey · on Dec 27, 2018

Even if you couldn't parallelize decompression (non-entropy encodings?) there's nothing stopping you from breaking data up into multiple streams and separately (de)compression those. For games, this might be decompressing different assets simultaneously.

twotwotwo · on Dec 27, 2018

FWIW, my aim was to give an illustrative example of what I meant by a chain of dependencies, and your garden-variety gunzip foo.gz has one. You can certainly come up with compression schemes or situations where not everything has to be serial.

SomeHacker44 · on Dec 26, 2018

Modern GCs do a lot to improve performance by using temporal locality, not just the memory layout / access pattern locality that ESR was primarily interested in. I am no expert, but we now have multiple generation, multiple phase garbage collectors that can do useful work in parallel and have minimized the need for stop the world processing (one of ESR's intrinsically serial processes, at least from the executing program's perspective, as it may be an internally parallelizable process).

twotwotwo · on Dec 27, 2018

Can't tell if you're agreeing or disagreeing with this, but what I was trying to say is that full GCs happen (even when generational lets you delay them), thrash your caches to trace all live pointers in your heap, and still run faster on multiple cores. That seems to be in tension with the blog post's suggestion that you need locality for parallelism.

Many recent GCs don't actually try to use all your cores because they run concurrent with your app and want to leave a lot of capacity for it, but Go 1.4's GC (the last non-concurrent version) used all the cores it could, and V8's Orinico does in some phases too.

chrisseaton · on Dec 26, 2018

> Parallelizing our software is mostly a question of investment at this point.

This just isn't true - for some algorithms we are simply not aware of any way to effectively parallelise. No amount of investment is going to suddenly come up with a solution to that problem.

We're just lucky that games are easy to parallelise.

gnarbarian · on Dec 26, 2018

>This just isn't true - for some algorithms we are simply not aware of any way to effectively parallelise.

That just means we need new algorithms, different habits, and languages which make conncurency more accessible.

The vast majority of research into algorithms and computer science in general over the past 75 yeas has focused on single threaded linear computation.

I think the thing we can do right now is spin up threads as often as possible. It should be your default approach when writing a new function to consider how to make it functionally pure. If you can, spin that out into a new thread (Depending on the overhead in your language). Even If you can't make it functionally pure there still a good chance you can write it in a way that prevents a potential deadlock or race condition.

My argument is basically this:

1) we will find more algorithms that are better at exploiting concurrent resources.

2) as programmers we need to change our default behavior from writing linear programs and then trying to make them take advantage of concurrent hardware to writing concurrent solutions at every level of application design by default from the ground up.

vmchale · on Dec 27, 2018

> as programmers we need to change our default behavior from writing linear programs and then trying to make them take advantage of concurrent hardware to writing concurrent solutions at every level of application design by default from the ground up.

With parallelism in particular, this is quite easy. Haskell's accelerate library allows you to write code using maps/folds which then runs on multiple CPU cores or a GPU (J does something similar but it's CPU-bound).

It turns out that maps and fold can already handle quite a lot.

With concurrency - well, concurrency is hard, and many basic data structures are in fact quite recent!

leoc · on Dec 27, 2018

A lot of this discussion is recapitulating the discussion around https://en.wikipedia.org/wiki/Gustafson%27s_law , no?

rabidrat · on Dec 26, 2018

Most software is not one algorithm. Most software is a complex system which can be parallelized in a variety of ways, some of them embarrassingly trivial.

chrisseaton · on Dec 26, 2018

I also think most software doesn't run for example 32 independent algorithms at the same time, so some of them will have to be parallelised to use the large number of cores we're getting.

imtringued · on Dec 26, 2018

We had decades of lazy game developers not using more than one core and CPUs kept up with the demands of those game developers. Consider that games are one of the most performance sensitive an average user is going to come in contact with then how on earth are you going to find a single inherently sequential problem that is enough to saturate an entire CPU core when even games, which consist of thousands of divisible sub-problems and therefore are easily parallelizable, were fine with just a single one?

I can think of hundreds of inherently sequential algorithms we are using daily but nevertheless use an insignificant amount of CPU time.

Let's take binary tree traversal. Assuming a 100ns access penalty for every pointer this means we'd need a binary tree with a depth of 10000000 and this means it would have 2^10000000 elements to saturate a modern CPU core with a single query per second. That's quite a large dataset and I'm sure that even if you converted the entire universe into RAM you could never store that much data.

gameswithgo · on Dec 26, 2018

Yeah but that doesn't just come up that much as an actual limiter. Like say you have a graph search algorithm that is provably non parallelize-able, ok, fine. If you only have 1 graph to search, you are stuck. But likely you have many graphs you can search and spread them around.

vmchale · on Dec 27, 2018

Well, if you mean concurrency over parallelism, I think it's quite clear that there are many tasks that benefit from concurrency.

chrisseaton · on Dec 27, 2018

You don't need concurrency to create parallelism, and it's the parallelism that we want, so no we don't mean concurrency.

llukas · on Dec 26, 2018

Utilization is bad metric - ie. you may spin cores to acquire lock.

How much of that utilization is doing useful work is important.

ivalm · on Dec 26, 2018

I mean for many games you have rendering engine/network code/audio/hard drive io/game logic be asyn to each other. A lot of game logic can then be further parallelized.

hinkley · on Dec 27, 2018

Agreed. Maxed out cores doesn’t mean progress is being made, literally or figuratively.

A breakdown of how that budget is spent would be a productive addition to this discussion.

nickpsecurity · on Dec 27, 2018

Ill add that it can help to use languages that make expressing parallelism both easier and the default. Cray's Chapel is a good example with lots of versatility:

https://chapel-lang.org/

vmchale · on Dec 27, 2018

Personally I'm a huge fan of J for this purpose.

Haskell also has some nice qualities, though one has to use something like accelerate or repa.

kace91 · on Dec 26, 2018

What kind of foundational software are you referring to? Honest question.

pcwalton · on Dec 27, 2018

libpng, to name one example off the top of my head. It doesn't even have support for SIMD on x86 yet, in 2018.

hinkley · on Dec 27, 2018

libpng should be theoretically capable of using one and a half cores per image decode. But unfortunately one of the compression techniques in the spec introduces a hazard between each line of the image, because the pixels at x-1 and y-1 are used to predict the value of next pixel. If that algorithm is used it would take some very interesting math to decode multiple lines in parallel.

mcguire · on Dec 26, 2018

I believe games were one of ESR's special cases, which is highly parallelizable. On the other hand, most of the software in daily use isn't cpu bound on one core.

gameswithgo · on Dec 26, 2018

games is a special case, video editing is a special case, compiling code is a special case, running a development environment with a database, web application, and browser is a special case,

rendering a web page is a special case

lot of special cases!

I know for my day job software development machine the point of diminishing returns is definitely at 4 cores (or maybe more, haven't tried yet!) 2 cores is terrible.

However when I scale down and am NOT running two databases and a web server along with visual studio, then 2 cores is fine. Also when I do that I am making a game that makes effective use of all the cores, and it wasn't hard (granted it is another special case)

gumby · on Dec 26, 2018

I don’t see what this proves or disproves. The article points out that problems that are idly metric with clearly differentiated local data are amenable to parallelization — and potentially only such. Sounds like a lot of game visuals.

vmchale · on Dec 27, 2018

> High-end games are now running pretty close to 100% CPU utilization on octa-core console hardware

That's concurrency, no?

babyloneleven · on Dec 27, 2018

No, that's parallelism, because the cores are running in parallel.

The software is written using concurrency techniques such as threads, that work on both multicore machines and machines with a single core.

zozbot123 · on Dec 26, 2018

100% CPU on an octacore? See, ESR was right, it's bottlenecked by serial performance. Now 800% CPU utilization would be impressive!

smolder · on Dec 26, 2018

A multicore CPU is comprised of several cores. I don't think fully utilizing all cores means "800% CPU utilization"

bitwize · on Dec 27, 2018

Oddly enough the load average, as reported by uptime and xload, uses just this metric: a machine with one CPU pegged has a load of 1.0. A fully utilized octa-core machine has a load of 8.0.

david-gpu · on Dec 26, 2018

> Your odds of being able to apply parallel-computing techniques to a problem are inversely proportional to the degree of irreducible semantic nonlocality in your input data.

That's not right. GPUs do pretty well at tasks that have poor locality compared to CPUs, as long as there's a lot of parallelism to be exploited -- see for example ray tracing.

The reason for this is that CPUs depend on their large caches, which are useless when there is little locality, while GPUs simply switch to another warp while data is being loaded from memory, and they have much higher memory bandwidth anyway.

Also, GPUs are distinctively not systolic arrays. Hennesy's "Computer Architecture: a quantitative approach" has a good primer on Single instruction Multiple Threads (SIMT) architectures typically seen in GPUs.

wtracy · on Dec 27, 2018

SIMT architecture is why GPUs are actually not that great for ray tracing. In a nontrivial scene, some rays will bounce more times than others, and that means that the whole parallel execution unit is tied up until every single ray terminates. (The other "threads" essentially execute NOP instructions in the meantime.)

Despite what Nvidia would have you believe, a lot of VFX rendering is done on CPUs, not GPUs, and this is one of the reasons why. (For example, last I heard, Pixar does all their render work on Intel Xeons.)

michelpp · on Dec 26, 2018

> Also, GPUs are distinctively not systolic arrays. Hennesy's "Computer Architecture: a quantitative approach" has a good primer on Single instruction Multiple Threads (SIMT) architectures typically seen in GPUs.

Good catch, as pointed out by wikipedia:

"They are sometimes classified as multiple-instruction single-data (MISD) architectures under Flynn's taxonomy, but this classification is questionable because a strong argument can be made to distinguish systolic arrays from any of Flynn's four categories: SISD, SIMD, MISD, MIMD"

https://en.wikipedia.org/wiki/Systolic_array

pornel · on Dec 26, 2018

I've become parallelism optimist thanks to Rust.

The borrow checker basically doesn't even have a concept of a single-threaded program. Everything you write in Rust is required to be thread-safe. This makes Rust appear more difficult, but it's just front-loading the effort required to make the whole program thread-safe (which to me felt nearly impossible for non-trivial C programs). From there you can sprinkle Rayon all over the place, send async work over channels, etc.

The piece I'm missing is "Amdahl's-law-aware" profiler that would show not just where any code spends time, but only where the unparallelized bottlenecks spend time.

twotwotwo · on Dec 26, 2018

Re: showing bottlenecks, the Go latency tracer may be a neat reference point; it doesn't find serial parts automatically, but it shows the execution timeline so that you can see them. Besides serial bottlenecks, it can be useful for seeing where you blocked waiting on I/O and so on:

https://blog.gopheracademy.com/advent-2017/go-execution-trac...

There might already be similar tools for Rust; I know less about what's available there.

(Go folks also added some tools for tracing across RPCs, which is less relevant to what you said but just kinda independently neat: https://medium.com/observability/debugging-latency-in-go-1-1...)

steveklabnik · on Dec 26, 2018

Some of this functionality just doesn’t exist (no way to track gc when you have no gc), some of it uses C/C++ tooling (syscalls can be straced), some of it is just different (suspend points aren’t implicit), and some of it is on the way (tokio-trace is basically the real equivalent here but either doesn’t quite exist yet or is very new, I forget which to be honest).

vmchale · on Dec 27, 2018

Rust is nice, but it's missing the ease of parallelism you get from things like Haskell's accelerate or repa.

I prefer it for concurrency, but it's not a panacea. In particular, laziness is necessary for efficient immutable data structures, and immutable structures are a nice way to avoid some of the problems inherent in writing concurrent programs.

samatman · on Dec 27, 2018

Is this actually true? Clojure is eager and has efficient immutable data structures.

vmchale · on Dec 27, 2018

Some immutable data structures must be implemented strictly, others lazily.

There are known algorithms/data structures where you either have to use immutability or laziness in order to get an efficient implementation.

nestorD · on Dec 27, 2018

Note that lazy / eager is just a default : it just take more lines of code to write lazy datastructures (such as Okasaki's ones).

steveklabnik · on Dec 27, 2018

Check out the im crate.

vmchale · on Dec 27, 2018

There's nothing lazy in there, right?

steveklabnik · on Dec 28, 2018

I don’t 100% remember, but it’s efficient immutable data structures, written by Bodil, who I assume you know.

fooker · on Dec 26, 2018

> Everything you write in Rust is required to be thread-safe

Either you have a poor understanding of thread safety or Rust has solved the halting problem.

legulere · on Dec 27, 2018

Could you explain why solving thread safety is equivalent to solving the halting problem? I suppose you also include the absence of deadlocks into thread-safety? Often it means just freedom from data races.

rbehrends · on Dec 27, 2018

Freedom from data races is far too weak a condition to be useful.

Consider the following example: you have a single global mutex, and every read/write to memory is bracketed by a lock/unlock of that mutex. The resulting code, while horribly inefficient, is technically free of data races, but that does not really give you anything interesting in terms of thread safety.

Any useful definition of thread safety requires that you can reason about program behavior in the presence of concurrency.

One such approach is the idea of interference-freedom introduced by Owicki and Gries [1]. Somewhat simplified, assuming that you have a thread A with precondition P and postcondition Q, i.e. the Hoare triple { P } A { Q }, a thread B does not interfere with thread A if the parallel execution of A and B, given the same precondition P we can still prove Q ({ P } A || B { Q }), for any interleaving of A and B.

Obviously, proving such a property is undecidable in the general case. In practice, we therefore limit ourselves to simpler models that are easier to reason about by constraining how threads can interact with one another (e.g. transactions, actor model), but even then, you can still end up with situations that are undecidable in the general case.

[1] https://dl.acm.org/citation.cfm?id=2697004

Groxx · on Dec 27, 2018

>Any useful definition of thread safety requires that you can reason about program behavior in the presence of concurrency.

Theoretically yeah, but there's a lot of practical gain to be had even with that example. Other languages won't tell you if you forgot to use the mutex somewhere.

rbehrends · on Dec 27, 2018

Okay.

First, I was talking about useful definitions of thread safety. That was language-agnostic, not related to Rust. And if absence of data races were all that Rust guaranteed, then Rust indeed wouldn't be much of a help. What Rust actually does is require you to be (fairly) explicit about threads interacting with each other, which is a much stronger type of guarantee [1]. Rust comes with a default setting derived from the general shared-xor-mutable principle that threads don't mess with each other's data without permission, which makes it easier to reason about thread interaction.

Second, the OP was comparing Rust to actor languages. Actor languages also don't have data races and don't require mutexes (because the memory spaces of actors are disjoint and actors themselves are sequential). In fact, actor languages provide even stronger guarantees, as actors can only interact with each other at well-defined points. "Fearless concurrency" hails back to the 1970s. It is not a new invention, nor is it unique to Rust. It is great that Rust supports it, but I really wish people would stop talking about it being novel and unique.

[1] Which is not to say that Rust's choices don't come with tradeoffs, but then, in the area of concurrency, you cannot realistically avoid tradeoffs. In Rust's case, as with actors, the primary tradeoff is performance for safety.

sedachv · on Jan 1, 2019

> It is great that Rust supports it, but I really wish people would stop talking about it being novel and unique.

What I really wish people would learn from Actors is that even if your programming language prevents trivial race conditions over shared memory or message passing primitives, none of that guarantees anything about the behavior of the system you are building once you start to compose those primitives. Gul Agha's dissertation was really eye-opening.

fooker · on Dec 27, 2018

> Other languages won't tell you if you forgot to use the mutex somewhere.

Languages like C++ and D treat this as a library design challenge, not a core language feature. Look at std::lock_guard, for example.

pornel · on Dec 27, 2018

The Rust language doesn't know anything about locking, or even threads.

The thread safety is built using Send/Sync traits, which libraries for threading and synchronization primitives, atomics, etc. implement accordingly.

scottlamb · on Dec 27, 2018

The way you say "useful" brings to mind Inigo Montoya in Princess Bride: "you keep using that word. I do not think it means what you think it means."

Are there any practical / commonly used programming languages in which you can prove thread safety with that approach for significant programs? You mention it can be undecidable even with constraints like the actor model (Erlang). Maybe you just mean edge cases not encountered in real systems, but I'm guessing it's not practical. Why do you say it's useful?

I see accidental data races a lot more than any other kind of race condition; I'd guesstimate that avoiding them gets you 90+% of the way to thread safety. Yes, you can release and reacquire the mutex in inappropriate spots (i.e., ones during which invariants are broken), but in a typical program there are a lot fewer mutexes acquisitions/releases than data accesses in general, so it's a lot more practical to audit for this. I feel a lot better with a program that has no data races than one which has no guarantees at all (most languages).

Likewise, deadlocks are also a concern, but I'd say there are relatively few spots in your code where you have to be concerned about them. (For starters, they only occur with multiple contended resources (usually mutexes), whereas data race problems can and do happen with zero or one mutexes all the time) And the behavior when they happen is more consistent and far less subtle. The latter is similar reasoning to how people say that a panic is less worrisome than undefined behavior.

rbehrends · on Dec 29, 2018

> The way you say "useful" brings to mind Inigo Montoya in Princess Bride: "you keep using that word. I do not think it means what you think it means."

Concurrency and parallelism has been what I've been working on for several decades. I have, inter alia, implemented concurrency mechanisms with strong safety guarantees for more than one language (among them two relying on shared memory [1, 2]), so yes, I think I have a pretty good idea what I'm talking about.

> I see accidental data races a lot more than any other kind of race condition;

And I didn't say otherwise. Freedom from data races (with the caveat that there are also benign data races that you sometimes want to exploit and that you want language mechanisms for) is more or less a necessary byproduct of most schemes to reason about concurrency; but on its own, its woefully inadequate for the reasons that I talked about.

> I'd guesstimate that avoiding them gets you 90+% of the way to thread safety.

If this were the case, then you'd practically never have any problem with thread safety in actor languages; that is not the case.

A simple and extremely common pattern where freedom from data races falls short is the following:

  if (obj->condition())
    obj->action();

If `obj` were just a monitor (i.e. an object that guarantees atomicity for individual operations), the code still wouldn't be thread-safe. What you generally want is thread safety at the transactional level, not at the level of individual operations. Any genuinely thread-safe language requires mechanisms that operate at the transactional level (though, obviously, not necessarily literal transactions in the DB sense), i.e. sequences of operations that you want to be atomic or non-atomic only in a controllable fashion. (Though, at the same time, it is also necessary that any such mechanism provides escape hatches for when it would impede concurrency [3].)

> Are there any practical / commonly used programming languages in which you can prove thread safety with that approach for significant programs?

First, I am not saying that being able to formally prove properties is a necessary element of concurrency (though there are systems that do exactly that, but that's beyond the scope of what I mentioned). But even when you're reasoning about program behavior informally, that relies on the same principles. Your code assumes some precondition and wants to arrive at a postcondition and you have to be sure that other threads don't muck up what your code is trying to accomplish.

The larger point is that with just the absence of data races, we still have a combinatorial explosion of ways that threads could interact with one another: most useful concurrency schemes (including Rust's) function by cutting down on that combinatorial explosion and limiting ways in which threads can interact with one another (and thus, possibly, in undesirable ways).

[1] https://link.springer.com/chapter/10.1007/11424529_17

[2] https://onlinelibrary.wiley.com/doi/full/10.1002/cpe.3746

[3] E.g. https://dl.acm.org/citation.cfm?doid=1297027.1297042; while the paper is about software transactional memory, the mechanism extends to critical regions in general.

wtracy · on Dec 27, 2018

I suspect the answer is, Rust ignores the general case, and simply doesn't let you do anything that's not provably correct.

fooker · on Dec 27, 2018

You can't use that argument and have turing completeness at the same time.

pornel · on Dec 27, 2018

Yes, you can. You can emulate a Turing machine (except with a finite memory, of course) in safe Rust, but given that the Turing machine is entirely serial by design, I don't see how is that even relevant.

In practice, Rust has `unsafe {}` escape hatch that allows building of abstractions (like mutexes) for the safe majority of Rust using contained bits of as-safe-as-C code where necessary.

scottlamb · on Dec 27, 2018

> Everything you write in Rust is required to be thread-safe.

I wouldn't put it like that. You can (and almost certainly do) have data structures on the stack or heap that are not thread-safe (by which I mean they suffer from data races if accessed from multiple threads).

I'd say that Rust enforces that stuff that's available to multiple threads be thread-safe. That covers global variables (which are always assumed to be available to multiple threads; maybe this is what you mean by "doesn't even have a concept of a single-threaded program"). It also covers stuff passed to another thread in other ways (passed to the thread at construction time, sent over a channel, etc).

Rust describes thread safety with two traits: "Sync" (a type for which it is safe to share references between threads) and "Send" (a type that can safely be transferred across thread boundaries). The book talks more about these here: https://doc.rust-lang.org/book/ch16-04-extensible-concurrenc...

The neat thing is these are automatically implemented when doing so is safe[1], and the compiler requires them at the key spots. You don't need to rely on some half-trusted comment or other manual annotation declaring something as thread-safe or audit all the using code if you want to make something no longer thread-safe; code just won't compile if it's expecting something to be thread-safe that isn't.

The compiler tells you what the problem is, and in most cases, fixing it is simple: wrap something in a Mutex, replace a Rc with Arc (non-atomic reference count with an atomic one), replace a RefCell with a Mutex or RwLock. If it compiles and doesn't say "unsafe", there's no data race.

This is what people mean when they say Rust has "fearless concurrency". I'm not aware of any other language that guarantees lack of data races without giving up mutability or shared state.

(There are limits to the fearlessness, of course. You can still have data races by doing something dumb in code marked "unsafe". And Rust doesn't protect you against deadlocks at all, so you need to consider lock ordering and such as you would in most other languages. You can still have contention problems because your locking is too coarse. etc.)

[1] when you're doing something unsafe you might need to manually declare something as Send and/or Sync because the compiler can't figure that on its own that doing so is safe. I've encountered this when doing FFI; another case would be when implementing synchronization primitives. Most code shouldn't have to deal with this.

jandrewrogers · on Dec 26, 2018

Assertions about the difficulty of implementing effective parallelism reflects assumptions about software architecture that are common but not universal. The tools most people reach for tend to be either multithreaded locking of some type or functional "everything is immutable" style, neither of which scale well on real hardware. Complex applications that do use large numbers of cores efficiently use software architecture originally developed on supercomputers but is equally suited to ordinary machines with many cores: fine-grained latency hiding. In practice, this means extremely cheap logical concurrency, like proper coroutines, combined with a latency-aware scheduler (the harder part). This may not be "parallelism" in some strict sense but it has the same effect in terms of utilization and is sometimes used on supercomputers to maximize throughput.

Programming languages often have some level of support for necessary primitives, such as goroutines, but then hide them behind an unsuitable scheduler. On the machines where latency-hiding was originally developed as an architecture (notably Tera's MTA), optimal scheduling was done by the hardware itself which had perfect visibility into relative latency with no overhead. Things are not that simple on x86/ARM. If you want to implement effective latency-hiding, you'll be implementing an equivalent scheduler in software, and often by inference because sometimes the only way to know the true latency of something is by doing it. I don't think it is that difficult but schedule design is not something most programmers know how to do well -- it is magic buried in operating systems and libraries. Nonetheless, you see more examples of this kind of software design for performance reasons.

There are very few real-world applications that are so serial that latency-hiding is not effective at using all available cores, particularly in the server realm where the workload is inherently concurrent. The original target application for latency-hiding architectures was scaling graph analytics, which are often not amenable to more classical embarrassing parallelism.

jungler · on Dec 26, 2018

My understanding of AAA game engine coding concurs with this analysis. It's become increasingly common to get more frames of graphics rendered in the same time span by kicking off the next frame before the first has finished, which can be extrapolated to a preconfigured buffer of many frames latency, similar to what is seen today with digital audio. Since game programmers already have to breathe concurrency problems just to achieve real-time stateful systems, and even more so to give an online multiplayer experience, this isn't a huge leap: many issues of this type have been resolved from the beginning with a double buffer or an intentional frame of latency.

That said, the benefits of scaling up runtime game performance are diminishing as asset costs increasingly dominate the discussion: the extra scene detail comes with a price to the production cycle, and that may be what breaks the games-graphics-tech alliance before any fundamental limits to the technology are reached.

espeed · on Dec 27, 2018

See Halide, a latency-hiding scheduler/compiler for GPUs and CPUs...

Halide: a language for fast, portable computation on images and tensors http://halide-lang.org/

lambda · on Dec 26, 2018

> 3. Consequently, most of the processing units now deployed in 4-core-and-up machines are doing nothing most of the time but generating waste heat.

While it's true that most of the time a 4-core or more machine has idle cores, I'm not convinced this is a problem.

Yes, most of the time spent on a modern computer consists of doing things that aren't particularly computationally expensive. But our perception of the speed of a personal computer isn't based on some kind of steady state performance on batch jobs, I think that worst-case latency is a lot closer to how we perceive the speed of a computer. Of course, in specialized domains like HPC the steady state performance on batch jobs is important, but this article seems to be discussing personal computers.

Even if they come up rarely, you still hit computational bottlenecks at times when latency is important. Whether it's some bloated web page, some big Excel computation, a compilation for us programmers, or the like, there are plenty of times where you hit some kind of bottleneck in which you can make effective use of all of your cores, so even if 99% of the time you have idle cores, that 1% of the time can shape your perception of how fast your computer is.

And the fact that most of your CPUs are idle most of the time doesn't mean you are wasting power. Modern CPU cores conserve power when idle, via P-States (frequency scaling) and C-States (turning off idle cores).

It is probably true that there is a lot of software out there which could take better advantage of parallelism but doesn't, in some cases due to performance not being prioritized, in some cases due to some essential complexity of parallelizing the code, but also in many cases due to accidental complexity of parallelizing code due to unsafe languages like C and C++ that can make it quite difficult to refactor serial code into parallel without worrying about introducing significant bugs.

vmarsy · on Dec 26, 2018

I'm more optimistic about parallelism, and multi cores in general.

Multi cores could have other advantages than parallelism. On mobile, if there's 8 cores, and if the device/os is able to figure out the current load, and based on that turn off up to 7 of those 8 cores, the energy savings could be significant on the CPU side. (Depending on how much energy CPU uses compared to the radio or screen on a smartphone)

Another obvious advantage as pointed as the end of the article is individual processes running independent work next to each other. Such embarrassingly parallel work can keep all the CPU busy pretty well.

phkahler · on Dec 26, 2018

>>We can sum this up with a heuristic: Your odds of being able to apply parallel-computing techniques to a problem are inversely proportional to the degree of irreducible semantic nonlocality in your input data.

That's only true at the very high end, where you have so much data that it's not all physically close and transport time is relevant.

With the reality of stalled clock speeds and multi-core systems, people are finding the parallelism in many applications. Scaling really appears to be coming to an end - if not at 7nm then at 5 (or what they'll call 5). I don't think we're going to meaningfully exhaust our ability to exploit parallelism before then. They're not used in every application but they are used by enough things that we should have them.

petermcneeley · on Dec 26, 2018

The problem is the inverse. The problem is that there is not enough parallelism in hardware. Single digit core counts really are not that significantly parallel especially when there is a shared main memory bandwidth. An AVX2 code transform will give you more performance than a 8 core hyperthreaded code transform, without all the latency and coordination.

misnome · on Dec 27, 2018

On the other hand, Intel tried lots of (slightly slower) cores with the Phi series - And part of the reason for dropping/sidelining was that although they are an interesting platform for experiments, nobody could work out how to effectively use that many parallel cores with all of the other constraints, like memory bandwidth.

petermcneeley · on Dec 27, 2018

I highly doubt that the scientific community "could not work out how to effectively use that many parallel core" given that the gaming community effectively used the SPE cores of the PS3. These cores are considerably more difficult to use as they only have small local memory and have strong alignment constraints. If the phi does not have enough bandwidth to service the cores that would simply mean it is not a parallel architecture.

anujsharmax · on Dec 26, 2018

> 3. Consequently, most of the processing units now deployed in 4-core-and-up machines are doing nothing most of the time but generating waste heat.

The cores are not being wasted because we don't know how to parallelize computations - as evident by the scientific software run for high performance computing applications, which scales upto million cores.

Business decisions are responsible for these wastages. For example, many engineering software are licensed per core - so if you are using it on 16 cores, it costs A LOT more than using it on 4 cores. In some cases the licensing costs of the engineering software are higher than the cost of the hardware. So no one cares if few cores are wasted.

rossdavidh · on Dec 26, 2018

This would make it a much better idea to run on a machine with only the cores you are using, though, yes?

marcosdumay · on Dec 26, 2018

At some point you want those 16 cores so MS Word will have an acceptable performance, even if your auto-routing electrical CAD uses just 1 of them to solve its own NP-hard problem.

jillesvangurp · on Dec 26, 2018

I'm less pessimistic. IMHO an increasingly big factor is the amount of single core hardware out there that people have tended to optimize for at the cost of not properly utilizing multi core machines for the past decade or so. As quad core CPUs are becoming more or less standard, this is changing already. With 16-32 core CPUs becoming more commom, the pressure is on to utilize those as well.

The use cases are much broader than gaming. I'd argue any kind of creative work involving graphics, audio, video, already utilizes any amount of CPU you can throw at it. Also, software like Firefox is leveraging multiple cores with their ongoing Rust rewrite of Firefox. I tend to use multiple CPUs when running software builds. There is a lot UI code that these days is written asynchronously from the ground up; meaning utilizing multiple CPUs is less of a big deal than it used to be.

But yes, you need langauges, frameworks, drivers, operating systems, etc. that can facilitate this. I'd say there is quite bit of room for improvement here still.

marcosdumay · on Dec 26, 2018

Most desktop software does not use more than 1 core because most desktop software does not need anything near 1 full core. How much processing can it take to show you a button?

That does not change the fact that some software does need more, and people are wise to dimension their computers taking the largest usual load into account, not the average one.

That said, the point about memory locality is good. It is just not as important as the author makes - data tends to have locality, even physics pushes it there. But the point about non-parallelizable algorithms isn't good; who cares about the property of your algorithm? If the problems allow for parallelism (with local data), dump the algorithm and get one that scales.

Also interestingly, the authors does bother to link the limits to the von Neumann architecture, but didn't think about using the extra cores to JIT-compile the binaries into some other architecture...

mattnewport · on Dec 26, 2018

> Most desktop software does not use more than 1 core because most desktop software does not need anything near 1 full core. How much processing can it take to show you a button?

It seems like this ought to be true but in practice it's not, though the problem appears to be more due to developers not paying enough attention to performance than to fundamental limitations of the hardware. I can think of very few applications that I use on a day to day basis that do not have multiple areas where I wish they performed better.

twtw · on Dec 26, 2018

I'm not sure what the takeaway here is. If desktop users don't need (or can't use) more cores, then don't get them - and call an end to the periodic performance improvements, because CPUs won't get faster without more cores. Maybe that's fine.

On the other side, high performance computing demands more computational power, and the only promising path forward (at this time) is parallelism. I'm not motivated by optimism, but by pragmatism.

As a sidenote, I don't understand this section:

> We look at computing for graphics, neural nets, signal processing, and Bitcoin mining, and we see a pattern: parallelizing algorithms work best on hardware that is (a) specifically designed to execute them, and (b) can’t do anything else!

The same hardware is used for all of these things, and it was only (arguably) designed specifically for graphics (and even there "specific" is not certain).

nickpsecurity · on Dec 27, 2018

"and the only promising path forward (at this time) is parallelism."

Don't forget the FPGA's, esp if bundled with CPU. They still partly work by parallelism but I think reconfigurable hardware deserves its own mention.

zzzcpan · on Dec 26, 2018

> We know this because there is plenty of real-world evidence that debugging implementations of parallelizing code is worse than merely _difficult_ for humans. Race conditions, deadlocks, livelocks, and insidious data corruption due to subtly unsafe orders of operation plague all such attempts.

None of these problems have anything to do with parallelism. But everything with "shared memory multithreading". There are plenty of sane concurrency models implementations out there. Which brings me to this part:

> But the truth is, we don’t really know what the “right” language primitives are for parallelism on von-Neuman-architecture computers.

We do know what the right language primitives are. Well, at least if you follow concurrency and distributed systems, where it's not even an important question anymore, you just stick to actor model as a default choice and stop caring about these things.

gpderetta · on Dec 26, 2018

>Race conditions, deadlocks, livelocks

These have nothing to do with shared memory multithreading. You can easily have all of these with shared nothing distributed systems.

>and insidious data corruption due to subtly unsafe orders of operation

I'll give you this though.

zzzcpan · on Dec 26, 2018

That's stretching it quite a bit. You can implement shared memory multithreading on top of actor model and have all of the same issues. If you don't do that though, you won't have those issues. Same way that you can implement memory unsafe primitives on top of memory safe ones.

gpderetta · on Dec 27, 2018

I have seen deadlocks and race conditions in non shared memory concurrent systems many times. They are no way exclusive of shared memory systems or even more common. They have nothing to do with shared memory and everything to do with message ordering.

zzzcpan · on Dec 27, 2018

It doesn't have to be shared memory specifically, but it has to be concurrent use of something shared even if that something is completely virtual and only exists in your mind. Which then forces you to synchronize access to that something and deal with all those problems. It's not about message ordering though, not unless synchronization is involved.

waynecochran · on Dec 26, 2018

Depth First Search is not parallelizable at all? Provably so? Just divvy up subtrees.

nothrabannosir · on Dec 26, 2018

I'm assuming they mean actually Depth First, where splitting up a tree and traversing branches in parallel violates the "First" part.

It's a bit contrived, because obviously the name implies that it's sequential. But yeah, sometimes that's precisely what you need.

waynecochran · on Dec 26, 2018

Yes. I guess the divvying up is more like breadth first. Most cases where I would use DFS this would work.

wtracy · on Dec 27, 2018

Nothing that processes nodes in parallel can guarantee either correct depth first or breadth first ordering.

That said, it's easy enough to get effectively correct ordering by reassembling the results in the correct order as the workers finish.

nickpsecurity · on Dec 27, 2018

First thing I do when someone says something is impossible is DuckDuckGo it to see if it's been done. Second I read those claims, I DDG'd each one with "parallel" in quotes to narrow it down. Already submitted one to Lobsters. The DFS paper I havent read yet. There's definitely opportunities in prior work in some of them, though. Yall try it to see what you find.

http://iss.ices.utexas.edu/Publications/Papers/ICS2016.pdf

https://repository.upenn.edu/cgi/viewcontent.cgi?article=144...

rbehrends · on Dec 27, 2018

What he is presumably talking about here is lexicographic DFS (as opposed to an unordered DFS), i.e. a DFS where the traversal order matters. To be precise: given a rooted digraph G with a set of vertices, an ordered adjacency list for each vertex, and two vertices a and b in G, does a DFS traversal of the graph that follows each adjacency list in order, visit a before b or b before a?

In practice, what we are interested in in particular is both a reproducible preorder and postorder ordering based on the DFS traversal, which is of use in other algorithms (such as Tarjan's algorithm for strongly connected components [1]).

This problem is P-complete and under the assumption that P != NC and under the parallelization assumptions underlying NC [3], is difficult to parallelize.

So far, this is not very controversial, but in practice, NC does not properly reflect what we think of as "inherently sequential"; especially the assumption of a polynomial number of processors is a bit odd when you're not dealing with hardware circuits.

On top of that, we often aren't interested in all possible graphs, but only specific types of graphs.

As a consequence, yes, lexicographic DFS can be parallelized for many use cases or in a fashion that we care about. As a simple example, lexicographic DFS for trees can be parallelized efficiently.

More generally, lexicographic DFS is already extremely fast if we have an existing in-memory representation of the graph (linear in the size of the representation with a small constant factor). It may take longer to construct the graph than to calculate a depth-first ordering of the vertices. In this case, we may not want to parallelize the DFS algorithm as such, but organize the construction of the graph so that we can run DFS in parallel with the construction.

This is of particular interest if we can't fit the graph in memory or if computing the edges of the graph is expensive. For example, we may not be able to parallelize the search as such, but we can instead parallelize the neighbor expansion.

The biggest issue here is that lexicographic DFS is a rather specialized problem in the family of graph search problems and many, many interesting types of graph searches can be parallelized just fine. For example, we can easily construct a parallel version of the A* algorithm; it will not explore the graph in the precisely same order, but as we are dealing with an algorithm guided by a heuristic, this is rarely a problem in practice.

[1] Though we have, contra ESR, parallel algorithms for strongly connected components in directed graphs that do not require a DFS ordering for the entire graph [2].

[2] See the previous work section of this thesis, for example: https://www.cs.vu.nl/~wanf/theses/matei.pdf

[3] https://en.wikipedia.org/wiki/NC_(complexity)#The_NC_hierarc...

devxpy · on Dec 26, 2018

Well maybe not in desktops, but don't you think that mobile devices (even micro controllers, like ESP8266) have a lot to benefit from multi-core, even for just regular apps that have nothing to do with number crunching/gaming etc.?

foldr · on Dec 26, 2018

There's probably still plenty of scope for increasing single-core performance on devices like that.

devxpy · on Dec 26, 2018

Well here’s the thing though, I don’t feel like parallelism is just about speed. I for one like the idea of being able doing 10 things at once, and all tasks running at the same speed. It’s hard to do with current software, yes. But there a lot of room for improvement, no?

foldr · on Dec 26, 2018

If you make the processor 10x faster, then you can effectively do 10 things at once without having multiple cores.

ben-schaaf · on Dec 27, 2018

This isn't quite true. There's a lot more to CPU performance than how fast it can process instructions. Someone configured Intel's 9900k at 500mhz 8 cores and compared it to 4Ghz 1 core. The actual performance characteristics are vastly different. If I had to choose between 10x more cores or 10x faster cores, I'd choose more cores.

wtracy · on Dec 27, 2018

I'm having trouble imagining a performance benefit to two (otherwise identical) 1ghz cores over one 2ghz core, unless:

1. The two processes cause frequent evictions of the CPU cache in some way that two cores can mitigate by having separate caches.

2. You are able to pin a process to a CPU core, and eliminate all the overhead of running the process scheduler.

I would be curious to see the benchmark you're referring to. My gut tells me that the performance differences come from comparing different processor families.

MaulingMonkey · on Dec 27, 2018

The faster your core, the more sensitive you are to latency, even if you have infinite throughput. Say your DRAM has a latency of 65ns - on a single 2ghz core, an L3 cache miss is going to be "twice" as expensive in terms of clock cycles (130) as it would be on a 1ghz core (65).

So, to take a super contrived example, if you have a parallel program that needs to run 1B cycles worth of instructions with 1M worth of L3 cache misses. On a single 2ghz core that might take:

(2B cycles + 130 cycles / cache miss * 1M cache misses) = 2130 M cycles (1.065 seconds @ 2ghz)

On a dual 1ghz core, both sharing a single L3 cache of the same size (so same number of L3 cache misses), that might take:

(2B cycles + 65 cycles / cache miss * 1M cache misses) = 2065 M cycles (1.0325 seconds @ 2x1ghz)

Which is slightly faster. As long as you're bound by latency rather than throughput, and can perfectly parallelize the problem, more cores instead of more raw clock speed will win out, even with identical hardware and the two cores actually sharing some stuff (the L3 cache) and ignoring the bonuses of fewer L1/L2 cache misses (because they aren't shared, "doubling" your L1/L2 cache.) The machine I'm typing this from has 4MB of L3 cache and I frequently deal with hundreds of gigs of I/O. Suffice it to say, there are a lot of L3 cache misses.

Moving away from identical hardware - slower cores can get away with less speculative execution / shorter instruction pipelines, which make things like branch mispredictions cheaper in terms of cycle counts as well. This is why modern GPUs end up with hundreds of cores getting up into the 1Ghz range or so. They deal with embarrassingly parallel workloads, and can get better performance by upping core counts than they can by upping clock speeds.

wtracy · on Dec 28, 2018

It seems to me that the speed increase in your example comes from having twice the cache and therefore half the misses. You can do that without adding cores. (Barring edge cases where two threads use different pages that happen to share the same cache line.)

As for reducing the need for speculative execution: You're just replacing implicit parallelism with explicit parallelism. Whether or not that is a win depends on the workload and the competency of your developers. :-)

MaulingMonkey · on Dec 28, 2018

> It seems to me that the speed increase in your example comes from having twice the cache and therefore half the misses.

My math assumed a share L3 cache of the same size, no "twice the cache", no "half the misses". Same number of misses, but each miss is effectively half as costly, because it's only stalling half your processing power (one of your two 1 Ghz cores) for X number of nanoseconds instead of all of it (your single 2 Ghz core).

> As for reducing the need for speculative execution: You're just replacing implicit parallelism with explicit parallelism.

Yes and no. If you're already explicitly parallel (more and more common), you're just eliminating a redundant implicit mechanism that's based on frequently wrong branch prediction heuristics. One can optimize for those, but then one can argue just how "implicit" it really is...

> Whether or not that is a win depends on the workload and the competency of your developers. :-)

Of course. But there's a lot of embarrassingly parallel work out there, and improving building blocks, that don't take a geniuses to operate. Pixel shaders, farming video frames out to render farms, compiling separate C++ TUs... these are all things already made explicitly parallel on my behalf. Really maxing out the performance of a single modern ~4 GHz core is no simple feat either.

wtracy · on Dec 29, 2018

>Same number of misses, but each miss is effectively half as costly

Wow, I'm still trying to wrap my head around that.

I was going to argue that in cases where there's no way to split an algorithm across multiple threads without extremely frequent IPC, pipelining can probably get better performance than multiple cores, but then I realized that I have absolutely nothing to base that claim on.

I still wouldn't opt for more cores rather than higher clock speed without first trying to benchmark my expected workloads, but you have given me some very interesting things to think about.

ben-schaaf · on Dec 27, 2018

I think this is the source: https://www.youtube.com/watch?v=feO54CBUJCk The whole point of this test is that it's the same processor just configured differently, so there's no architectural or even silicon differences.

gpderetta · on Dec 27, 2018

For 99% of the problems, a 10x faster core will always beat 10x cores. The trade-off is of course different: for the same transistor count/power envelope you can get more than 10x cores.

devxpy · on Dec 28, 2018

In that case, an important question might be, whether we can actually make it 10x faster.

Given a point in the Moore's law curve, can we actually make a single core that's as fast as 10 separate cores? If yes, Will that be more power, space and cost effective?

riskable · on Dec 26, 2018

Seems like the solution is "the Unix way": Use a lot of little, single-purpose applications for competing larger tasks. That way you can let the OS worry about the parallelization.

Even if you're making a large, multi-purpose app like, say, a spreadsheet you could still subdivide zillions of little tasks into their own applications. On an OS this makes little sense because of the overhead associated with spawning new processes but on Linux it'd be great.

zozbot123 · on Dec 26, 2018

> you could still subdivide zillions of little tasks into their own applications.

This is the approach Rust+Rayon are taking. Rust is in a good position to enable this because of language-level guarantees. Using "a lot of little, single-purpose" tasks means that each task has to own its memory; Unix of course enforces this separation, but this comes with some overhead. Language-level discipline can achieve much the same thing, more efficiently.

sevensor · on Dec 26, 2018

I've used satisfiability solvers, but I don't know enough about their internals to evaluate ESR's claim that satisfiability is intrinsically serial. ESR's argument seems to hinge on this idea, that regardless of how well you handle parallelism, it's not going to help with certain tasks.

alan-crowe · on Dec 26, 2018

If I had to write a SAT solver to use lots of cores on an arbitrary boolean formula, I would farm out the work like this:

This walk the tree to find the variable, X, that occurs most often. There is a weighting issue. Occurrences higher up the tree count more.

Then I would create two expressions by substituting X = true and X = false.

Now two cores can race to see which can find a satisfying assignment first. Clearly this division can be applied recursively until all the cores are busy.

sevensor · on Dec 26, 2018

This is appealing, but I think you're only going to have full utilization for a short time? Many of your partitions of the problem are going to come back unsatisfiable, and then those cores go idle.

Const-me · on Dec 27, 2018

I have recently parallelized 3D flood fill algorithm, it’s essentially a graph search on a 3D grid of nodes. More precisely, the problem is known as connected-component labeling: https://en.wikipedia.org/wiki/Connected-component_labeling

The solution is to keep partitioning until you saturate all cores. As soon as all saturated, stop partitioning and compute. As soon as any task finished, resume partitioning.

On the lower level, I used a thread pool. BTW, the infrastructure for that is built-in in Windows, see SubmitThreadpoolWork API: https://docs.microsoft.com/en-us/windows/desktop/api/threadp...

rossdavidh · on Dec 26, 2018

All interesting, but if we're talking about typical laptop/desktop users, beside the point, I think. The reality is that the entire computer's performance is not the bottleneck, for most of what a typical user is doing, most of the time. The limitation on performance is either: 1) in the user's head (e.g. a word processor and an author) 2) in the network the computer is attached to (much of what is done on the web) Getting faster/better performance out of your laptop/desktop won't help any, for the reason that your computer's performance is already not the limiting factor, for most of what a typical user is doing. Although, this does reinforce the point that adding even more cores to a typical user's laptop/desktop, isn't really accomplishing much, but for much simpler reasons than the issues discussed in the article.

pjc50 · on Dec 26, 2018

3) in the storage

4) in the antivirus insisting on chewing each piece of IO very slowly in case it might have a virus

mattnewport · on Dec 26, 2018

This is not the case in practice for most of what I'm doing, most of the time. I'm more of a power use than a typical user perhaps but I also have much higher end hardware than a typical user and I'm continually frustrated by performance issues that are primarily CPU, GPU or local storage related, although the network being a limiting factor is also common.

IshKebab · on Dec 27, 2018

> I coined the term “SICK algorithm”, with the SICK expanded to “Serial, Intrinscally – Cope, Kiddo!”

Not only is that awful, but something is only "coined" if other people actually use it. You can just say "I will coin this!".

mcguire · on Dec 26, 2018

"My regulars include a lot of people who are likely to be able to comment intelligently on this suspicion. It will be interesting to see what they have to say."

smallstepforman · on Dec 26, 2018

ESR doesn’t know of the Actor programming model. Intersting. In all honesty, neither did I until 16 months ago. Now I’m transitioning our parallel code to an Actor model.

nickpsecurity · on Dec 27, 2018

Check out Pony language. It's like Rust for Actor-based programming. Wallaroo uses it in their production database. So, it's probably decent quality for a new language.

https://www.ponylang.io/discover/#what-is-pony

gpderetta · on Dec 27, 2018

No silver bullets.

avmich · on Dec 26, 2018

What does he mean by "bad locality"?

Why depth-first search can't be parallelized?

crimsonalucard · on Dec 27, 2018

Dfs is not parallel? Why not fork at every recursive call?

glenrivard · on Dec 26, 2018

One things that is not talked about a lot is parallelism in terms of kernel and I/O.

There is the aspects of an application but what about system code?

If read the FlexSC paper and using multiple cores you can see some pretty incredible improvements are possible.

https://www.usenix.org/legacy/event/osdi10/tech/full_papers/...

We should also see some of this with Zircon as it gets further along and looks to leverage cores in a type of pipelining.