Well, it's a first announcement on a blog. They say it accelerates TensorFlow by...

CydeWeys · on May 19, 2016

The cool thing here is that it's already widely deployed in production, so we know it actually works and that it's not just vaporware. The "more details to follow" is thus a lot more convincing than for a lot of similar-seeming announcements of things that don't actually exist yet (and often don't ever).

lightcatcher · on May 19, 2016

> They say it accelerates TensorFlow by 10x

They say 10x performance / watt, nothing about performance per unit time.

DigitalJack · on May 19, 2016

You can make some assumptions though. If the power consumption was equal, the performance is 10x.

The speed at which an ASIC will run is constrained by temperature (power dissipation) and and logic timing, which itself has a dependency on temperature.

So we could call that vertical scaling, to some power ceiling which may not take us all the way to 10x, but it's not impossible.

Then there is horizontal, which I assume is applicable to these problems... running more in parallel.

In both cases, I think it's safe to assume they are getting a performance increase in the instantaneous sense.

zenlikethat · on May 19, 2016

> You can make some assumptions though. If the power consumption was equal, the performance is 10x.

While I agree some performance per unit increase is likely, how does a direct 10x increased based on power savings follow? Less power usage does not mean that the chip can run through more flops in the same amount of time, right?

DigitalJack · on May 19, 2016

It does if power was the limiting factor in clock speed.

archgoon · on May 19, 2016

The relationship between clockspeed and power consumption is nonlinear.

http://electronics.stackexchange.com/questions/122050/what-l...

(see graph in the first answer)

Also, it's not known that the TPU have a way to allow to increase the clockspeed arbitrarily, nor is it known whether their architecture is capable of ensuring correctness at arbitrary clock frequencies. Some architectures make assumptions like "The time for this gate to reach saturation is very small compared to the clock frequency, so we'll pretend that it's instantaneous."

Florin_Andrei · on May 19, 2016

> They say 10x performance / watt

Well, that's the kind of metric you'd expect from a cloud provider. That's what's important to them.

If you're a tinkerer dabbling in TPU acceleration on your gaming/coding PC alongside with GPU acceleration, then the metric that would be interesting for you is speed increase per unit.

hyperopt · on May 19, 2016

In March at the GCP NEXT keynote [1], Jeff Dean demos Cloud ML on the GCP. He casually mentions passing in the argument "replicas=20" to get "20 way parallelism in optimizing this particular model". GCE does not currently offer GPU instances. I've never heard the term replicas in the GPU ML discourse. These devices may enable a type of parallelism that we have not seen before. Furthermore, his experiments are apparently using the Criteo dataset, which is a 10GB dataset. Now, I haven't looked into the complexity of the model or to what extent they train it to, but right now that sounds really impressive to me.

1: https://youtu.be/HgWHeT_OwHc?t=2h13m6s