More

neilmovva · on Feb 12, 2018

It matters when defining parallel work distribution. Unless memory bandwidth is homogeneous across the whole board (i.e. each TPU on a board gets 600 GB/s to its peers), we can't do model parallelism across ASICs efficiently, and must fall back to data parallelism. Which is fine, until you run into limits on maximum batchsize (e.g. up to 8192, as FAIR was able to manage [1] with some tweaks to SGD).

[1] https://arxiv.org/abs/1706.02677

neilmovva · on Sept 20, 2017

Intel's "10nm" has roughly twice the transistor density vs. Samsung/TSMC "10nm", so I wouldn't compare based on the advertised process names.

lvh · on Sept 20, 2017

That's true, but only if Intel's _proposed_ 10nm specs matches what they ship. They haven't shipped anything, that's the problem.

agumonkey · on Sept 20, 2017

What about ryzen, I forgot the size and which foundry they use.

neilmovva · on Sept 17, 2017

The fact that "2TB is addressable" is irrelevant. Putting NAND on the board doesn't improve latency/bandwidth nearly enough to function like vram. Nvidia has also supported unified virtual memory since Pascal, meaning you can address your cpu's ram in GPU code. The "SSG" card still has the 2tb of flash on a PCIe interface, so not much difference from existing systems beyond marketing. I'd expect very few real world perf wins.

neilmovva · on Aug 1, 2017

Vega is already drawing near 300W, and is so high up on the voltage/frequency curve that even a measly 5-10% gain in core clock can easily cost over 100W more.

Here, AMD is once again (see last year's Polaris) a victim of the inferior GlobalFoundries 14nm LPP process. TSMC 16nm would have been much better in perf/W, but sadly a very restrictive wafer supply agreement locks AMD to GloFo for the time being.

ksec · on Aug 1, 2017

I would have had the same conclusion if Ryzen was not also a 14nm LPP with GloFo.

It is just sad AMD could not get both Graphics and CPU momentum at the same time.

neilmovva · on Aug 1, 2017

"graphics memory" is probably not the right term here. While the Pro SSG does have 2TB of storage on board, that comes in the form of a NAND flash, PCIe SSD (from Samsung IIRC). While fast, this SSD provides single digit GB/s bandwidth at most, compared to the hundreds of GB/s of bandwidth that the DRAM-based GDDR5 or HBM provide. When we say "graphics memory," we're almost always referring to the latter, and never see more than maybe 32 GB of DRAM based memory on board.

hughw · on Aug 1, 2017

Yeah, that seems like false advertising. I believe that much true HBM/GDDR5 would have to cost tens of thousands of dollars. So I was skeptical. That said: if I could get a card with that much real GDDR5 on it, I'd consider paying whatever it took.

snuxoll · on Aug 1, 2017

More curious about the latency and how that impacts performance, even with DMA going over the PCIe bus to a NVMe SSD is a longer trek.

neilmovva · on Aug 2, 2017

GPUs are generally not latency optimized anyway - if an app is sensitive to tens of microseconds then it belongs on the CPU, broadly speaking.

AMD's advantage here is getting drop-in throughput, by switching the SSD directly to the GPU. Thus, the onboard SSD doesn't waste host PCIe lanes (4 lanes, typically) that are getting hard to come by in regular desktop computing systems.

justinclift · on Aug 1, 2017

Isn't the "2TB" NVMe SSD (in this case) directly interfaced to the card, so not going over the system PCIe bus?

snuxoll · on Aug 1, 2017

It is, which is why I'm curious how this improves performance compared to exactly the scenario of hitting NVMe storage over the bus.

neilmovva · on July 19, 2017

Taking a more skeptical look at the issue, we might have Intel to blame for the stagnant system interconnect.

As it gets harder each year to improve CPU performance, the HPC community has shifted to more workload-specific accelerators, most notably GPUs for high-throughput parallel data processing. These accelerators still rely on the host CPU to dispatch commands, but in recent years we've seen workloads really focus on and target the accelerator device (eg neural networks on the GPU).

If one wants to build a multi-GPU cluster, then the CPU quite plainly "gets in the way" - performance can very quickly be bottlenecked by weak inter-device bandwidth (PCIe 3.0 x16 = 16GB/s, vs. the GTX 1080 Ti onboard DRAM's ~500 GB/s). Not to mention the fact that the PCIe controller is on the CPU die, meaning inter-node bandwidth and latency also strictly favor the CPU.

For large-scale systems, the value proposition of multi-GPU is greatly neutered by reliance on the PCIe bus, and so Intel stays relevant for many applications. And for the last 5 years, Intel's utter dominance of the HPC/server market meant that they could limit PCIe lanes without much pressure from their customers. With Ryzen/EPYC (128 PCIe 3.0 lanes!!) that old order looks set to change.

ccmonnett · on July 19, 2017

A friend who works on NVLink would have a very similar response if prompted. And they were (probably still are) trying some more interesting ways of working around PCIe that were hamstrung by Intel et al, given the control they have over the controller market as well. It's a shame but we'll see if they pay for it after all!

NVIDIA is also in a tough spot in that they can't cooperate too closely with either Intel or AMD now that the race for accelerators is heating up. Understandably they're pretty psyched about Tegra but I'm not sure I'd want to rely on that and I bet they don't either.

jchw · on July 19, 2017

I don't think that's the whole story, though. With some advances in storage, PCI-e is not far from being a bandwidth bottleneck in even consumer PCs (thinking about NVMe as an example.)

noinsight · on July 19, 2017

> NVMe

Even there the CPU can be the bottleneck. I have a Samsung 960 Pro that's theoretically capable of 3GB/s reads but when you use disk encryption even with AES-NI the processor can only do ~2GB/s.

zacmps · on July 19, 2017

What about when SIMD?

noinsight · on July 19, 2017

That would probably help. Not sure what Linux LUKS supports specifically, but it's testable via "cryptsetup benchmark".

CyberDildonics · on July 19, 2017

What are the current maximum bandwidth speeds used by PCIe solid state memory? I'm not aware of any that go over 4GB/s, let alone the 16GB/s of PCIe 3.0.

neilmovva · on March 1, 2017

fp32 is almost never "gimped," since that's the basis of performance for most video games and 3D applications. perhaps you meant fp64? That sees more use in industrial apps (e.g. oil and gas exploration) and so is often slower on consumer cards, and historically was artificially limited. For recent GPUs, it's not quite fair to call fp64 gimped, since the fp64 equipped cards are completely different designs - the lack of fp64 on this card is a design choice, not an artificial restriction for product segmentation.

wtallis · on March 1, 2017

FP16 is also becoming a feature of interest, especially for machine learning.

gcp · on March 1, 2017

The card reportedly has a 4x speed INT8 mode.

pandascore · on March 1, 2017

Is it worth to buy this GPU instead of the PASCAL X (price apart) FP speaking since 1080 is for gamers and PASCAL X meant for ML/DL ?

emcq · on March 1, 2017

I'm not sure what the original post is referring to but many early GPUs were not fully IEEE FP32 compliant, where many operations (particularly trig functions if I recall) would have 24 or fewer bits, either not supporting FP32 or requiring many cycles to compute. This article seems to describe some of the Radeon cards: https://en.m.wikipedia.org/wiki/Minifloat

makomk · on March 1, 2017

The Radeon cards mentioned in that article are really old ones that predate modern unified shader architectures, DirectX 10, and compute support. As far as I know all non-mobile-phone GPUs released in the last decade or so support FP32 precision, though not necessarily with full IEEE compliance.

ClassyJacket · on March 1, 2017

I think they may have meant fp16, which is of interest to machine learning programmers.

gigatexal · on March 1, 2017

Yeah I was referring to the arbitrary gimping of the cards to drive Tesla sales

neilmovva · on Jan 4, 2017

minor correction - the 15" retina MacBook pro is 2880x1800. The 13" retina MacBook pro is 2560x1600.

neilmovva · on Aug 17, 2016

> "only worth 28B"

I feel like this statement is telling in and of itself - since when is a $28B startup with Uber's (lack of) revenue and lasting IP considered normal?

Perhaps these numbers have yet a ways to fall.

pavlov · on Aug 17, 2016

Uber must be doing billions in revenue, since they handle the payments on all rides ordered through their system. They're not giving free rides after all.

honkhonkpants · on Aug 18, 2016

That might be one way to look at it, but I doubt that is relevant. The total money that uber handles would be equivalent to ebay's "gross merchandise volume". The transaction takes place between the rider and the driver and uber is only facilitating it. You wouldn't say that a credit card company has the entire economy as its revenue, so you wouldn't say that all ride charges are Uber's revenue.

pavlov · on Aug 18, 2016

The rider and driver can't negotiate prices independently or pay through another service.

Uber determines both the display price offered to the customer and the revenue share paid to the driver, so IMO the total sum of transactions really is all Uber's revenue.

WalterSear · on Aug 18, 2016

I'd hazard they are making billions in sales. Elsewhere in this thread they are quoted at claiming to make 20 cents per ride, $62 million total, in revenue.

pavlov · on Aug 18, 2016

Revenue is the same thing as sales. The $62 million number would be gross profit.

neilmovva · on June 22, 2016

The hurdle is heat - we can do something like this in mobile SoCs, which commonly place DRAM on top of the CPU (package-on-package). But for a TDP > 10 watts, the memory layer effectively insulates the main die from whatever thermal management is used, making it unworkable. Unfortunately, this problem stands to get worse as transistors shrink, since power density will keep going up.