How GPUs Work

ak217 · on Dec 15, 2014

This overview, while a great start, doesn't really dive into the details of how modern GPUs work. Since 2007, many of the limitations that held GPUs back from being general-purpose computers have been removed (by relentless efforts of NVIDIA and to a lesser extent ATI/AMD, spurred in large part by NVIDIA's traction in the supercomputing space, for example http://en.wikipedia.org/wiki/Titan_%28supercomputer%29). My go-to source for a lot of these developments is AnandTech (http://www.anandtech.com/tag/cuda) but I'm sure there are plenty of other resources others can point to.

Another fascinating bit is that NVIDIA and ATI/AMD have developed what are now the largest general-purpose processors in the world (over 5 billion transistors per chip and counting - available in consumer GPUs for under $300, as opposed to Intel's largest Xeons that top out at 4 billion and cost $2000+) but are being held back at the 28nm process because their fab partner (TSMC) is oversubcribed by smaller, higher demand ARM chips that go into phones.

rgbrenner · on Dec 15, 2014

> but are being held back at the 28nm process because their fab partner (TSMC) is oversubcribed by smaller, higher demand ARM chips that go into phones.

TSMC plans to begin 16nm finfet production early 2015.. although they're doing it so they can supply apple and keep up with samsung (who also supply apple and have plans for 14nm/16nm).

nvidia's parker is suppose to use finfet and they're a customer of TSMC.. but parker will be a 64-bit arm cpu for servers/mobile devices.

ChuckMcM · on Dec 15, 2014

This from January of last year : http://www.dailytech.com/TSMC+Were+Far+Superior+to+Intel+and... suggested 16nm at the end of this year. Vs Intel doing 14nm.

mej10 · on Dec 15, 2014

What are you meaning by general-purpose here? Do you not have to use a different programming model anymore?

dantillberg · on Dec 15, 2014

The parent is referring to how CUDA's introduction enabled developers to write and compile C-ish code to run on a GPU, while previously programmers could take advantage of the power of GPUs for non-render computations by hacking the pixel shaders and such, bending the graphics hardware to do something it was not designed for.

You still have to write your program in a very different way in order to run efficiently on GPUs as opposed to CPUs.

mej10 · on Dec 15, 2014

The comparison to a Xeon is what confused me. I didn't think such a development had taken place.

m_mueller · on Dec 15, 2014

I'd sum it up like this: GPGPU can be made to run any computational code - this doesn't mean that it's necessarily faster though (otherwise we could just forget about CPUs couldn't we?). A few things need to be true in order for GPUs to execute something with a speedup compared to CPU:

1) Code needs to have at least 1k, better 10k+ parallel 'threads'.

2) These threads should be largely data parallel (branching is possible but hurts performance more significantly than on CPU).

3) Registers and memory per thread are limited, around 30-60 registers and 400-800k memory are the limits to achieve a reasonable saturation. If you disregard this, spilling of memory will occur (or the memory will just run out, there's no swapping so it will just crash).

4) Because of (1) and (3), GPUs like so called 'tight loops', i.e. many parallel but smallish kernels.

sklogic · on Dec 16, 2014

2) Some GPUs do not penalise branching

3) Some GPUs have MMUs (and share their paging with the host CPUs)

m_mueller · on Dec 16, 2014

2) You mean Xeon Phi / Knights Corner? Yes, I wouldn't call these GPUs though, they have to be regarded a bit differently. They have quite a few problems in other areas btw., so far the results of theses systems are not very promising.

3) By MMUs do you mean unified memory? Well yes, but for now this is so slow that you don't really want to use it. This might change on Power systems with nvlink and for Knights Landing generation Intel accelerators, but that's still in the future / not publicly available.

sklogic · on Dec 16, 2014

I was rather talking about some of the mobile GPUs. It would have been highly inefficient to have a non-uniform memory in the mobile devices (although a uniform memory do not always imply an MMU on a GPU side, it's a totally different story).

yan · on Dec 15, 2014

If anyone is even marginally interested in GPU internals, you'd do yourself a favor by checking out John Owens' UC Davis class on the topic[1]. I once watched the first lecture just to fill an hour and ended up going through the entire course within the span of a week, following up with my own research later on. Superbly interesting.

[1] https://itunes.apple.com/us/itunes-u/graphics-architecture-w...

or on youtube: https://www.youtube.com/playlist?list=PL4A8BA1C3B38CFCA0

Joky · on Dec 15, 2014

Thanks!!

See also https://fgiesen.wordpress.com/2011/07/09/a-trip-through-the-...

slackito · on Dec 15, 2014

Anyone interested in how GPUs work should read the series of blog posts by Fabian Giesen "A trip through the graphics pipeline": https://fgiesen.wordpress.com/2011/07/09/a-trip-through-the-...

fat0wl · on Dec 15, 2014

I'm really sorry I will try to dig up the source myself, but I've read basically the opposite argument in a few technical papers -- that GPU is NOT as fast as claimed for many classic test algorithms (actual speed-up is more like factor of 2 than 10) and that the performance gap between traditional CPUs and GPUs is actually lessening.

I'm going to read this article anyway to hear their take & for the learning experience, but does anyone remember any of the counter-arg articles?

tmurray · on Dec 15, 2014

disclaimer: I work in this space and have done so for a while, including previously on CUDA and on Titan.

GPUs for general purpose computation were never 100x faster than CPUs like people claimed in 2008 or so. They're just not. That was basically NV marketing mixed with a lot of people publishing some pretty bad early work on GPUs.

Lots of early papers that fanned GPU hype followed the same basic form: "We have this standard algorithm, we tested it on a single CPU core with minimal optimizations and no SIMD (or maybe some terrible MATLAB code with zero optimization), we tested a heavily optimized GPU version, and look the GPU version is faster! By the way, we didn't port any of those optimizations back to the CPU version or measure PCIe transfer time to/from the GPU." It was utterly trivial to get any paper into a conference by porting anything to the GPU and reporting a speedup. Most of the GPU related papers from this time were awful. I remember one in particular that claimed a 1000x speedup by timing just the amount of time it took for the kernel launch to the GPU instead of the actual kernel runtime, and somehow nobody (either the authors or the reviewers) realized that this was utterly impossible.

GPUs have more FLOPs and more memory bandwidth in exchange for requiring PCIe and lots of parallel work. if your algorithm needs those more than anything else (like cache), can minimize PCIe transfer time, and handles the whole massive parallelism thing well, then GPUs are a pretty good bet. If you can't, then they're not going to work particularly well.

(now, if you need to do 2D interpolation and can use the texture fetch hardware on the GPU to do it instead of a bunch of arbitrary math... yeah, that's a _huge_ performance increase because you get that interpolation for free from special-purpose hardware. but that's incredibly rare in practice)

fat0wl · on Dec 16, 2014

ah, yes. :) very nice detailed summary of some of the issues in this sect of "academia" (I put that in quotes only because all the research seems to be co-written by corps).

I am into audio DSP & am planning to port a couple of audio algorithms (lots of FFT & linear algebra) to run on GPU but haven't even gotten to it because I considered it a pre-mature optimization to this point. I'm sure it would improve performance, but nowhere near what GPU advocates would claim.

My biggest reason? "PCIe transfer time to/from GPU", plus it would be unoptimized GPU code. Once you read a few of these papers it becomes painfully obvious that a lot of tuning goes into the GPU algorithms that offer anything more than a low single-digit factor of speedup. It's still very significant (cutting a 3 hour algorithm down to 1 would be huge) but if you're in an early stage of research it may be a toss-up over whether its better to just tune the algorithm itself / run computations overnight rather than going through the trouble of writing a GPU-based POC. Maybe if you have 1 or 2 under your belt its not such a big deal but for most of the researchers I know GPU algorithm rewrites would not be trivial. (I've been doing enterprise Java coding for about 2 years now so the idea isn't so intimidating now, but in a past life of mucking around with Matlab scripts I'm sure it would have been daunting).

_ihaque · on Dec 15, 2014

tmurray is basically right: most of the really big reported gains are artifacts of unoptimized CPU code. Except for hardware special functions on GPUs, you shouldn't be able to exceed the theoretical perf ratios between GPU and CPU, which are roughly ~30x in FLOPs and ~10x in bandwidth. Depending on whether arithmetic or memory is the algorithmic bottleneck, you'll hit one of those limits.

I wrote a paper [1] on this in one particular domain (computational chemistry) more or less as a rebuttal to a paper that claimed enormous GPU speedups; it was a consequence of slow CPU code, not especially fast GPU code.

[1] http://cs.stanford.edu/people/ihaque/papers/2dtanimoto.pdf

sklogic · on Dec 15, 2014

One should not forget another important thing: 1GFLOP in CPU is more expensive in terms of power than 1GFLOP in a GPU. So it's not only about chasing GFLOPs, in the mobile and embedded world it's also all about power.

brigade · on Dec 15, 2014

But normalizing for power reduces the GPU advantage even more! Haswell for instance achieves about 5.3-5.8 GFLOP/W, compared to 24-28 GFLOP/W of Maxwell. That's less than a 5x theoretical computational gain.

sklogic · on Dec 15, 2014

Take a look at, say, Raspberry Pi: 24GFLOP for 1/2W. You won't get this for any number of mobile CPU cores.

brigade · on Dec 15, 2014

Mobile SoCs claimed numbers are hard to take at face value. For one, I'm 98% sure that's FP16 flops. For another, basically all SoCs in shipping devices throttle under load, so efficiency is hard to determine from unrelated peak performance and max power draw numbers.

Anyway, Cortex-A15 is capable of 8 flops per cycle per core which puts it pretty good in theoretical efficiency for its likely power draw at current clocks.

sklogic · on Dec 15, 2014

No, these are fair 32bit GFLOPs. No, VC4 do not throttle, power figures are given for the real max load.

And I never managed to get close to 8 ins per cycle on A15, but, for example, an FFT implementation on VC4 is pretty close to a theoretical performance limit. And a fully loaded 4-core A15 will draw far above 500mW anyway.

brigade · on Dec 15, 2014

It's one instruction per cycle that gets 8 flops. And what are you arguing even? Assuming its unthrottled FP32, that gives a quad-core A15 at 2GHz 7 watts to be over 5x less efficient.

sklogic · on Dec 16, 2014

I'm arguing that if all you have is 1W, you've got no other option but GPU.

tmurray · on Dec 15, 2014

but that's GLES 2.0, which is significantly less flexible than the kinds of GPUs we're discussing here and is not even in the same ballpark as a CPU (and almost certainly significantly less strict in terms of floating point precision than a GLES 3 device).

kragen · on Dec 16, 2014

https://github.com/raspberrypi/userland/blob/master/host_app... is part of the Raspberry Pi GPU FFT example code. That is not GLES 2.0 or even GL of any kind. That's VideoCore QPU assembly language to compile with qasm. I haven't tried writing anything for it, but it certainly looks like it's "the kinds of GPUs we're discussing here" and "in the same ballpark as a CPU".

sklogic · on Dec 15, 2014

Yet, it's pretty sufficient for things like FFT.

fat0wl · on Dec 15, 2014

hah i dont understand why this was (even temporarily) downvoted. You do realize that the CPU & GPU industries are at war with each other, right?

I read the article now, cool technical overview -- but basically all of these processor arch articles have a slant in all the paragraphs where they wax poetic (abstract, analysis/conclusion). I think it would be helpful for people to be aware of this...

AFAIK Nvidia (their company name is at the top of this paper btw in case u weren't paying attention) are trying to generalize their chips to the point where they can enter the CPU market, and Intel chips can render 3D graphics well enough to handle most games that are ~5 years old (since Haswell or maybe one or 2 gens before).

So this isn't a particularly slanted article but there is a fair amount of propaganda / contrived performance studies in this market... NVIDIA & Intel are vying for each other's core customer bases. Anyone interested in the field should dig up the articles that try to debunk perfomance myths as well as studying architecture overview.

(Some of the sentences in the last few paragraphs, for example, made me sorta queasy & would get shredded on Wikipedia.)

userbinator · on Dec 16, 2014

One thing that's always put me off from studying GPUs in detail is the proprietariness of everything; with few exceptions (Intel being one of them recently, and surprisingly enough Broadcom for the RPi), there's no detailed datasheet or low-level programming information publicly available for modern GPUs, and what is available is still not all that complete. Contrast this with CPUs where a lot of them have full, highly-detailed information on everything from pinouts to how to get them to boot. People have made their own simple computer systems by wiring up a CPU on a circuit board with some support chips, but I don't think I've seen anything like this done for any reasonably recent or even ancient GPU.

(I know there are VGA reimplementations available, and the VGA is quite well-documented, but that's more of a timing controller/dumb frame-buffer than a real GPU.)

rasz_pl · on Dec 16, 2014

There was this

https://www.flickr.com/photos/73923873@N05/sets/721576287942...

http://www.edaboard.com/thread236934.html

http://hackaday.com/2012/10/08/stm32-driving-a-pcie-video-ca...

Standalone code running on _not plugged into anything_ Radeon HD2400

authors blog: http://www.pixel.io/blog/ he never released any source, actually he had something posted to github, but made repo private

Arelius · on Dec 16, 2014

AMD have actually released a surprising amount of documentation, for instance:

http://developer.amd.com/sdks/AMDAPPSDK/assets/AMD_Southern_...

http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/...

http://www.amd.com/Documents/GCN_Architecture_whitepaper.pdf

dfox · on Dec 16, 2014

Significant part of functionality of moder GPU is in software that abstracts away differences between different models and generations, from this point of view it does not make much sense to document actual interface between software and hardware. Other thing is that complexity of this software abstraction layer is comparable to the GPU itself and manufacturers do not expect that somebody would want to implement all that from scratch (this is similar to e.g. FPGAs, where even when you know bitstream format, you still have to write something non-trivial that generates the bitstream).

sklogic · on Dec 17, 2014

I would have been delighted to know the FPGA bitstream formats, place&route is certainly hard, but not at all impossible.

And some of the GPU vendors are publishing their datasheets specifically in a hope that an alternative open source driver stack will appear.

abecedarius · on Dec 16, 2014

You could make the same arguments against documenting the machine code of a CPU.

dfox · on Dec 16, 2014

For CPU, there is no another processor that can run all the abstraction software so it has to be done in hardware or in software in a way that is transparent to user (microcode, Transmeta-style JIT...).

abecedarius · on Dec 16, 2014

That's an implementation detail: the manufacturer supplies the system software, and by this argument you're not supposed to care where it runs.

yazaddaruvala · on Dec 15, 2014

Maybe off topic, but I'm actually really surprised that monitors and GPUs are still different pieces of hardware.

I'll admit I only know the basics of GPU architecture, so please forgive/correct me if I'm wrong about something. However, I am just too curious not to share.

I'll try to explain. A frame buffer is nothing but a bunch of 1s and 0s in memory, meanwhile a monitor is just a bunch of 1s and 0s in pixels. We currently have the GPU write to memory in parallel and we currently write pixels to a monitor serially (and therefore interlacing). However, given the similarity between memory and pixels, why then can't we optimize a GPU to (parallely) write to pixels instead of memory. To the extreme, you could optimize, your GPU to have 1 shader per pixel, and since the shaders all run on the same clock cycle, the whole monitor would update simultaneously. I think that would be really cool and more importantly efficient. In more practical terms you would probably have 1 GPU shader be responsible for some group of pixels (so you only need 1 shader per 4x3 pixels or per 16x9 pixels).

So, before you say it, I get you might disagree with me when it comes to desktop GPUs, since 1. the GPU memory needs to be close to RAM (you don't want to have the GPU memory be on the other side of a "long" cable) 2. You would like to update the hardware for a GPU separately from your monitor. However, in something like Mobile/Oculus, the form factor is so small/tightly coupled already, I'm surprised optimizations like this aren't being looked into.

Am I just not up to date? Is there something fundamentally wrong in my logic? Does getting rid of the frame buffer/interlacing, not provide as much of a boost to make this worth while?

com2kid · on Dec 15, 2014

A number of problems. A huge one is wiring. Parallel is really complicated electrically, noise drowns out your signal and things run slow. This is why most of our buses have switched over to serial (e.g. USB, PCIe, etc). Sometimes we run those serial busses in parallel, but that still works out to being easier.

Timing is another huge one. Imagine running 2 million wires (for a 1080p display) that have to all be the exact same length to within some tolerance.

The longer those wires gets the harder this gets. This is also another huge reason why the move to serial buses has happened. You can run 4 wires with really tight timings and the bits will fly, but if you try and run 16 wires all together, speed ends up dropping dramatically. Reality is that circuit boards don't have room for a large number of traces running in parallel of all the exact same length!

RAM is a huge exception to this, but extreme measures have been taken to enable this to happen, a good chunk of your Mobo is taken up getting the RAM connected, and memory controllers moved onboard the CPU in part to get RAM closer to the CPU to simplify traces,

Note this is all the perspective of a software guy who has to listen to the hardware team grumble for most of the day. :)

cscheid · on Dec 15, 2014

Myer and Sutherland wrote a classic 1968 paper on what they came to call the Wheel of Reincarnation: simple displays accrue progressively more complexity until someone comes along and throws the whole thing away with a new, clean, design.

Then someone finds that they can add a bit of processing to that display to make it go just a bit faster...

http://cva.stanford.edu/classes/cs99s/papers/myer-sutherland...

sklogic · on Dec 15, 2014

In most of the modern architectures GPUs are even decoupled from the video adapters. Especially in the mobile world.

JoshTriplett · on Dec 15, 2014

Shaders already run heavily in parallel. However, you can't write directly to monitor pixels in parallel like that, because you'd then need milions of individual connections to individual pixels, rather than scanning logic.

Along similar lines, consider that CPUs have billions of transistors but only a little over a thousand pins.

ryanseys · on Dec 15, 2014

I'm currently taking an introductory course in computer graphics and we've been taught most of the things covered in this article, including the theory of the Phong lighting model and the graphics pipeline with different types of shaders. This is more or less a 25,000-foot overview of how computer graphics works and how the images on your screen came to be. It's highly interesting stuff and knowing a small amount of the math behind how it works really gives me an appreciation for the things I see in video games and 3D animations. I wish this article had gone further to explain how the GPU actually produces results in the highly-parallel way that this articles seems to skim over.

pkaye · on Dec 15, 2014

I wish there was a good book on GPU architecture and even micro-architecture. I just like reading about this stuff and how they work.

oneofthose · on Dec 15, 2014

There is an excellent slide deck by Kayvon Fatahalian [0] that I consider to be the best high-level introduction into the topic (especially if you have a deeper understanding of how a CPU works). But I agree, more detailed insights would be great.

[0] http://s08.idav.ucdavis.edu/fatahalian-gpu-architecture.pdf

tmurray · on Dec 15, 2014

Kayvon teaches at Carnegie Mellon now and his class slides are definitely worth reading:

http://graphics.cs.cmu.edu/courses/15869/fall2014/ http://15418.courses.cs.cmu.edu/spring2014/

bcbrown · on Dec 15, 2014

Here's a 100-page booklet on the hardware/software interface that covers a lot of that kind of stuff:

http://sbel.wisc.edu/Courses/ME964/Literature/primerHW-SWint...

The last 20 pages are on GPUs.

sipherhex · on Dec 16, 2014

Disclaimer: I work in the GPU industry.

If you're interested in the architecture of a GPU this Berkeley ParLab presentation by Andy Glew from 2009 covers the basics of how the compute cores in modern GPUs handle threading. It's a subtle, but powerful, difference from SIMD or vector machines.

http://parlab.eecs.berkeley.edu/sites/all/parlab/files/20090...

If you want to get into the details of how a GPU interfaces with the system and OS software, which is almost an entirely other animal, you may want to look at the Nouveau project to get oriented.

http://nouveau.freedesktop.org/wiki/

yaantc · on Dec 15, 2014

The latest (5th) edition of "Computer Architecture: A Quantitative Approach" by Hennessy & Patterson covers GPU (among many other things). One chapter on data parallelism covers SIMD, vector processors and GPUs. Very good overall IMHO, but better check the content to see if it suits your expectations.

mkagenius · on Dec 15, 2014

Can't access the link, server overloaded?

"Description: Could not connect to the requested server host. "

Is there any other link for the paper?

teraflop · on Dec 15, 2014

Try the direct PDF link: http://www.cs.virginia.edu/~gfx/papers/pdfs/59_HowThingsWork...

JabavuAdams · on Dec 15, 2014

The title should be changed to reflect that this is from 2007. The graphics that the article praises now look dated.

nemothekid · on Dec 15, 2014

The 8800GTX was the first GPU I every bought back in 2007 (obviously I'm not very old). Now 7 years later, its funny how dated the render on "Figure 2." is.

semi-extrinsic · on Dec 16, 2014

When I started university I had a laptop with a 80 MHz, 8MB memory discrete GPU ;) I believe it was an ATI Rage LT Pro.