Google supercharges machine learning tasks with TPU custom chip

luu · on May 18, 2016

I'm happy to hear that this is finally public so I can actually talk about the work I did when I was at Google :-).

I'm a bit surprised they announced this, though. When I was there, there was this pervasive attitude that if "we" had some kind of advantage over the outside world, we shouldn't talk about it lest other people get the same idea. To be clear, I think that's pretty bad for the world and I really wished that they'd change, but it was the prevailing attitude. Currently, if you look at what's being hyped up at a couple of large companies that could conceivably build a competing chip, it's all FPGAs all the time, so announcing that we built an ASIC could change what other companies do, which is exactly what Google was trying to avoid back when I was there.

If this signals that Google is going to be less secretive about infrastructure, that's great news.

When I joined Microsoft, I tried to gently bring up the possibility of doing either GPUs or ASICs and was told, very confidentially by multiple people, that it's impossible to deploy GPUs at scale, let alone ASICs. Since I couldn't point to actual work I'd done elsewhere, it seemed impossible to convince folks, and my job was in another area, I gave up on it, but I imagine someone is having that discussion again right now.

Just as an aside, I'm being fast and loose with language when I use the word impossible. It's more than my feeling is that you have a limited number of influence points and I was spending mine on things like convincing my team to use version control instead of mailing zip files around.

mike_hearn · on May 18, 2016

Google has always been strategic about announcing what it was doing, even since the early days (I used to work there too). Think about the impact the first MapReduce and GFS/BigTable papers had.

My guess as to why they're announcing the TPU is that they are feeling the pressure from Facebook and other AI labs, and want to reinforce their reputation as being the best place to do AI research. By revealing that AlphaGo was based on this hardware, they indicate to researchers around the world that if you want to build the most advanced ML models you need to be at Google. Same reason they talked about MapReduce/GFS back in the day.

Krakulf · on May 18, 2016

It really boosts Google Cloud Platform with image and prestige, they're coming out with what is pretty much tensorflow as a service with their machine learning product.

Coming out saying they can give you a service no one else can right down to a custom chip may sway a few buyers in the market.

https://cloud.google.com/ml/

manav · on May 18, 2016

Well this also shows their commitment to Tensorflow if they willing to pay to fab custom chips for it. Then again, in the Google scheme of things that's probably not a huge cost.

They've probably carefully modeled the energy savings over GPU/FPGA and found it to be substantial enough, even taking into account costs of design changes.

vidarh · on May 19, 2016

If you need enough chips it doesn't take particularly careful modelling to say that you'll beat an FPGA quite substantially in most cases.

hendler · on May 18, 2016

For me this is true. GPUs on Amazon are not cheap, and if google passes the savings on from performance power consumption I'd certainly offload some work to Google cloud.

engizeer · on May 18, 2016

This is true - I've heard rumors that GCE has been trying hard to poach cloud consumers from other providers, this could be another stroke in that effort.

vintermann · on May 19, 2016

It might also give Nvidia some sorely needed worries.

arcanus · on May 18, 2016

> By revealing that AlphaGo was based on this hardware

Interesting, as the nature/science paper made no mention of this, it was exclusively trained on GPUs.

Merovius · on May 18, 2016

Note that the version of AlphaGo that beat Fan Hui and was presented in Nature is significantly different from the Version that played Lee Sedol. Unless you believe that they didn't work on it for half a year.

indolering · on May 20, 2016

This explains a lot. Going into the match, Sedol thought he could beat DeepMind but thought it might only be a couple of years until the technology outpaced him. We knew Fan Hui and other Go professionals were helping the team, but a massive speedup is always nice too.

It's a bit underhanded, however. IMHO, the player should be able to study recent games before the match. But this is pretty typical, there were similar late-stage improvements with Chinook (checkers) and Deep Blue (chess).

vintermann · on May 19, 2016

It's possible it was trained on GPUs but ran on TPUs.

honkhonkpants · on May 18, 2016

Training and inference are not the same thing.

arcanus · on May 18, 2016

Sure... I think you are missing the point. In Nov., the Nature paper which contains the AlphaGo algorithm, the hardware detailed was exclusively CPU+GPUs.

Between that time and the Lee Sedol match, the hardware running AlphaGo was switched, to these TPU.

From the paper: "The final version of AlphaGo used 40 search threads, 48 CPUs, and 8 GPUs. We also implemented a distributed version of AlphaGo that exploited multiple machines, 40 search threads, 1,202 CPUs and 176 GPUs."

dsl · on May 18, 2016

Or they consider the TPU to be something similar to a math co-processor, it just offloads work from the CPU. They also didn't discuss the custom built power modules, or the custom built network switches used for interconnects.

rdtsc · on May 18, 2016

"We have an FPGA, oh but need to go faster, ok, build an ASIC then" is a natural thing to come up with. That's kind of what bitcoin farms did.

Obviously details and plans how it was done is where all the good stuff is, so that being hidden is understandable.

randyrand · on May 18, 2016

I'm not even a hardware or AI person, and even I could have told you ASICs would make way more sense than GPUs or FPGAs for machine learning. It's all about data locality. Fetching memory is the most costly thing a GPU does, and for ML (DNNs) there no big need for global access memory 99% of the time.

Anyone the casually follows AI knows that people have been talking about making DNN ASICS for some time. It was all a matter of time and $$$$$$

There is no doubt FB is working on them too. Which is why Google is finally publicaly saying that "we did it first ;)"

mtgx · on May 18, 2016

> It's all about data locality.

That's kind of what Movidius says, too, about its Myriad 2 VPU, which is kind of a GPU (SIMD-VLIW) with larger amounts of local memory combined with hardware accelerators.

rdtsc · on May 18, 2016

> Google is finally publicaly saying that "we did it first ;)"

Makes sense. In that respect yeah, they probably wanted to keep it under wraps to avoid Facebook/others from getting a timeline estimate out of it and jump ahead.

jldugger · on May 19, 2016

It makes a lot more sense now why they open sourced Tensor Flow -- they have an ASIC design that supports it, and if they either want to sell the chips, or sell access to them via Google Cloud Platform, they'll need customers. And it's not like HP is about to roll out a server line sporting these any time soon, so Google will continue to have their advantage over the outside world.

rcpt · on May 18, 2016

> it's impossible to deploy GPUs at scale, let alone ASICs

A bit old but cf. DE Shaw and Anton https://en.m.wikipedia.org/wiki/Anton_(computer)

dekhn · on May 19, 2016

I think the jury is truly still out on whether the anton designs are worth it (compared to commodity clusters/GPUs, etc).

and Anton wasn't really "scale" in this sense. It was a vertically scaled single (or several) machines.

blue11 · on May 18, 2016

OK, so what is it? The announcement neither says what a TPU actually is nor what it can do. It's a magic black box. No specs. No price.

Florin_Andrei · on May 18, 2016

Well, it's a first announcement on a blog. They say it accelerates TensorFlow by 10x. They say it fits in an HDD slot. And the whole announcement must stay within a page or two.

It's a "more details to follow" type of thing. Pretty standard actually.

CydeWeys · on May 19, 2016

The cool thing here is that it's already widely deployed in production, so we know it actually works and that it's not just vaporware. The "more details to follow" is thus a lot more convincing than for a lot of similar-seeming announcements of things that don't actually exist yet (and often don't ever).

lightcatcher · on May 19, 2016

> They say it accelerates TensorFlow by 10x

They say 10x performance / watt, nothing about performance per unit time.

DigitalJack · on May 19, 2016

You can make some assumptions though. If the power consumption was equal, the performance is 10x.

The speed at which an ASIC will run is constrained by temperature (power dissipation) and and logic timing, which itself has a dependency on temperature.

So we could call that vertical scaling, to some power ceiling which may not take us all the way to 10x, but it's not impossible.

Then there is horizontal, which I assume is applicable to these problems... running more in parallel.

In both cases, I think it's safe to assume they are getting a performance increase in the instantaneous sense.

zenlikethat · on May 19, 2016

> You can make some assumptions though. If the power consumption was equal, the performance is 10x.

While I agree some performance per unit increase is likely, how does a direct 10x increased based on power savings follow? Less power usage does not mean that the chip can run through more flops in the same amount of time, right?

DigitalJack · on May 19, 2016

It does if power was the limiting factor in clock speed.

archgoon · on May 19, 2016

The relationship between clockspeed and power consumption is nonlinear.

http://electronics.stackexchange.com/questions/122050/what-l...

(see graph in the first answer)

Also, it's not known that the TPU have a way to allow to increase the clockspeed arbitrarily, nor is it known whether their architecture is capable of ensuring correctness at arbitrary clock frequencies. Some architectures make assumptions like "The time for this gate to reach saturation is very small compared to the clock frequency, so we'll pretend that it's instantaneous."

Florin_Andrei · on May 19, 2016

> They say 10x performance / watt

Well, that's the kind of metric you'd expect from a cloud provider. That's what's important to them.

If you're a tinkerer dabbling in TPU acceleration on your gaming/coding PC alongside with GPU acceleration, then the metric that would be interesting for you is speed increase per unit.

hyperopt · on May 19, 2016

In March at the GCP NEXT keynote [1], Jeff Dean demos Cloud ML on the GCP. He casually mentions passing in the argument "replicas=20" to get "20 way parallelism in optimizing this particular model". GCE does not currently offer GPU instances. I've never heard the term replicas in the GPU ML discourse. These devices may enable a type of parallelism that we have not seen before. Furthermore, his experiments are apparently using the Criteo dataset, which is a 10GB dataset. Now, I haven't looked into the complexity of the model or to what extent they train it to, but right now that sounds really impressive to me.

1: https://youtu.be/HgWHeT_OwHc?t=2h13m6s

_uiqz · on May 19, 2016

Since nobody has pointed this out, although he may be able to tell people "I was working on that thing", he likely can't say much more. Many large companies don't allow employees to disclose details that haven't been released publicly.

PeterisP · on May 19, 2016

The article doesn't actually imply (though people might wish so) that the described TPU is intended to ever be a product available to others.

It's a post saying "that's how we do it" that serves a bunch of political and PR goals, but it's quite likely that this will stay an internal technology; maybe available indirectly as a cloud computing offering - in which case they'll give the specs and price of the whole solution, not of a particular model of TPU chip.

amelius · on May 19, 2016

It seems it is a GPU-like device but tuned towards neural net applications, and with a more convenient HDD-like formfactor.

sowhatquestion · on May 19, 2016

Hold up, you mean teams at Microsoft are "mailing zip files around!?"

panic · on May 18, 2016

How different are TPUs from GPUs? From the article, it sounds like TPUs use lower-precision arithmetic: are there any other differences?

sitkack · on May 18, 2016

Not just lower precision but probably less error correction in the lower bits. -Or- ... they convert the floating point values to analog values and do all the math in the analog domain and convert it back, just like old analog computers. But by going fully analog, I think the gains would be over 10x better, so probably not.

See 'low-power (inexact|approximate) computing'

redcalx · on May 19, 2016

They likely use half precision (i.e. 16 bit) floats in some of the computing stages. They've also discussed using even fewer bits of precision in some of their research output. When training and following an error gradient you kind of need lots of precision, but when running a learned model you often don't need much at all, 8 or even 4 bit numbers are sufficient in some calcs where rounding error propagation isn't an issue.

chlestakoff · on May 19, 2016

Now that you are free to talk about it, could you explain how ASICs stack up against GPUs, in practical terms?

DigitalJack · on May 19, 2016

ASIC stands for "application specific integrated circuit" and the name conveys a special purpose design of an integrated circuit, like for example bit coin mining.

But that's just terminology/convention. One could argue a GPU is an ASIC, and a CPU is an ASIC. The only thing to argue is how specific does an application have to be to call it an ASIC instead of some other made up name.

The prime differentiators are the process, which is the term used for the many steps of fabrication of an IC (processes are often referred to as nodes, distinguished by the smallest feature size of a transistor they create), and the ability of the design engineers to create an optimal design, in terms of boolean logic and semiconductor physics.

Given the right team of engineers, and a top notch foundry, and a great deal of experience in the problem domain (machine learning in this case), a custom IC could very likely trounce a GPU.

GPUs were designed for the domain of graphics processing, which happens to have some commonality with the processing in machine learning. But, at least until recently, GPUs weren't focused on machine learning. Just graphics.

Now the GPU vendors are trying to leverage their knowledge of graphics processing and building of graphics processors to create machine learning processors, but the thing they are leveraging could also be what handcuffs them. Which gives opportunity for a company like Google to do a fresh take on the problem domain without the baggage of the knowledge of graphics processing.

zymhan · on May 18, 2016

Wouldn't you typically prototype the design on an FPGA and then manufacture an ASIC once you'd worked out the kinks?

gchadwick · on May 19, 2016

Chip designs are usually simultaneously tested in software simulation, FPGA and hardware emulators (basically special purpose super computers that run Verilog or HDL, not quite FPGA speeds but far better debug capabilities, approaching that of software RTL simulation, downside is they cost a few million a piece).

You'll also break a design down into units or blocks. These can be tested separately. E.g. you might see an L1 cache as a single unit and have a testbench (piece of HDL to stimulate and check a design) to test only that. This is useful as it's far quicker than a full design test however extensive full design testing is still required as many bugs you see as the consequence of the interactions between units.

Once you've done a lot of testing you'll tape out a chip. This may well have some issues so you'll do a new 'spin'.

amelius · on May 18, 2016

I'm suspecting that the design contains a lot of repetitive units, each of which can easily be tested in software. So perhaps testing on an FPGA isn't even needed.

zhemao · on May 19, 2016

I think zymhan meant prototyping on an FPGA to see if you could actually see an improvement by mapping parts of your algorithm to custom hardware. This would be a sensible first step before committing to a full ASIC run. I'm sure Google did do this initially.

As for testing for correctness, ASIC designs are almost always tested in software simulation instead of on FPGAs. Mainly because RTL written for FPGAs can look quite different from RTL written for ASIC synthesis. There's also a lot incidental engineering work needed to get something working on an FPGA.

gchadwick · on May 19, 2016

> ASIC designs are almost always tested in software simulation instead of on FPGAs. Mainly because RTL written for FPGAs can look quite different from RTL written for ASIC synthesis. There's also a lot incidental engineering work needed to get something working on an FPGA.

It's actually very common to test ASIC RTL on FPGA. You are right that if you are targeting an FPGA you may write things differently, but this doesn't mean RTL written for an ASIC won't work it's just it might not clock as fast as it could and may not make the most efficient use of resources.

There is a lot of engineering work to take an ASIC targeted IP and run it in an FPGA, but thorough verification of an ASIC design is extremely important so it's usually worth the time to bring up an FPGA that can run it (sometimes as it's too large you have to split it across multiple FPGAs).

raverbashing · on May 19, 2016

Ah I'd state the opposite. Your design will have new problems showing up when going from FPGA to ASIC

Especially if you try to make it go faster

joshhart · on May 18, 2016

I've only dabbled in this area, so it's entirely possible my question doesn't make sense. Machine learning algos in general seems to have a bunch of iterative components, but within each iteration there is massive parallelism. Did you consider using a whole bunch of DSPs? They can also do parallel processing but might be cheaper in bulk than GPUs.

gsnedders · on May 18, 2016

My vague impression is that there had started to be more and more research on use of ASICs for ML (though I can't say I follow it that closely). Enough of a changing tide that the competitive advantage was going?

sandGorgon · on May 19, 2016

What was your design flow? Did you use the standard cadence, Synopsys toolset to build this? How did you achieve basic design and JIT(or whatever it is you call the machine code part of the Python VM ) together?

Rrally interested to know how this was managed...

auvi · on May 19, 2016

any idea which foundry google have used for the fabrication?

rolandog · on May 19, 2016

I wonder if this is a response as well to Nvidia's announced Tesla P100 chip?

amelius · on May 18, 2016

There were no references to this TPU in the TensorFlow source code?

hedgehog · on May 18, 2016

No but if you look at the code it's clearly designed to support pluggable hardware.

zenlikethat · on May 19, 2016

Very interesting observation -- this seems so much more obvious in retrospect given the meticulous mapping of graph to devices outlined in the paper.

bd · on May 18, 2016

So now open sourcing of "crown jewels" AI software makes sense.

Competitive advantage is protected by custom hardware (and huge proprietary datasets).

Everything else can be shared. In fact it is now advantageous to share as much as you can, the bottleneck is a number of people who know how to use new tech.

mtgx · on May 18, 2016

Indeed. I kind of realized this more recently when I saw some new chips promote that Tensorflow works on them. This is really what it was all about - getting chips everywhere to support Google's version of AI, which means that in the long term they can get those chips themselves in volume - chips that are optimized for their AI out of the box.

tptacek · on May 18, 2016

http://www.joelonsoftware.com/articles/StrategyLetterV.html

dzdt · on May 19, 2016

Joel Spoelsky : "commoditize your complements." Steve Ballmer : "developers! developers! developers!"

Developers are a complement to hardware, software, and data based businesses. Google probably spends more on employees than on hardware or data. They really want to commoditize developers that can work on their stuff.

Google's business is based on having more and better data than their competitors. In many cases they have been happy to commoditize hardware, making open their datacenter designs for instance. That helps to cheapen their costs for building datcenters.

In other cases they open source software, like tensorflow and go language. These choices are made to commoditize developers. Google wants there to be a big pool of people who know how to use the technologies that Google uses. More developers means less costs for Google to hire and train employees. Which is their biggest expense: win!

With the TPU, as long as no one else is doing that kind of hardware for machine learning, it is a proprietary advantage to keep it secret. But at some point the logic flips: when others start to do similar things Google would rather commoditize their version of the tech. Because the lesson of the last 40 years is any widely used hardware WILL become commoditized. The inertia is with software codebases and developer knowledge.

Developers developers developers developer developers!

kristianp · on May 19, 2016

Cool how he foreshadows the end of Sun (takeover by Oracle in 2010) in that article from 2002:

"Sun's two strategies are (a) make software a commodity by promoting and developing free software (Star Office, Linux, Apache, Gnome, etc), and (b) make hardware a commodity by promoting Java, with its bytecode architecture and WORA. OK, Sun, pop quiz: when the music stops, where are you going to sit down? Without proprietary advantages in hardware or software, you're going to have to take the commodity price, which barely covers the cost of cheap factories in Guadalajara, not your cushy offices in Silicon Valley."

incepted · on May 19, 2016

Predicting that a company will fail without giving a date is not foreshadowing, it's stating the obvious.

jmoiron · on May 19, 2016

What? He also predicted how and why it would fail. Sun was a big enough player then that it could have survived plenty of other ways. Apple of today looks nothing like the company in 2000, but Sun got caught out more or less exactly as described and never adapted.

eva1984 · on May 18, 2016

Agree. To attract devs to using your spec, then eventually go to your platform. But since the spec are open, AMZN/MS can make their own too. So it is still a good thing.

abritishguy · on May 18, 2016

I think this shows a fundamental difference between Amazon (AWS) and Google Cloud.

AWSs offerings seem fairly vanilla and boring. Google are offering more and more really useful stuff:

- cloud machine learning

- custom hardware

- live migration of hosts without downtime

- Cold storage with access in seconds

- bigquery

- dataflow

dragonwriter · on May 18, 2016

I think the difference you observe relates directly to the difference between what Google does outside of Cloud Platform and what Amazon does outside of AWS.

gcr · on May 18, 2016

Vanilla? Boring?

I read "Vanilla" and "Boring" as "Horray, I don't have to spend time rewriting all this complicated code I already have!"

If I'm just dipping my toes into (say) Caffe or Theano, I don't have to rewrite it from scratch.

That is a huge advantage---not a disadvantage!---of AWS over google.

vgt · on May 18, 2016

Your point is valid, but I think what the OP was saying is that Google is offering all this stuff IN ADDITION to the boring stuff.

Google does boring stuff very well too.. and one can argue much better than AWS as well.. take a look at Quizlet's story: https://quizlet.com/blog/whats-the-best-cloud-probably-gcp

(shamelessly biased Googler)

jbooth · on May 18, 2016

If I recall correctly, it took Google a while to actually offer the boring stuff. For a while, you could get a Google Compute Engine but you couldn't just get a dang VM image, because Google knows better than you and you should do things their way. They've fixed it now, but lost a lot of potential market share for that conceit.

boulos · on May 19, 2016

"So"?

If you're evaluating something today, how does it change your decision that we were late to market with Compute Engine (and in this specific case "bring-your-own-kernel")?

If it's about future boring stuff, I think the list of boring stuff isn't too long ;).

Disclosure: I work on Compute Engine.

engizeer · on May 19, 2016

All given, the fact that Google itself doesn't extensively use GC is kind of a red flag(I know quite a few Googlers from search infrastructure and none of them said their teams used GCE internally).

A solid guarantee with AWS is if AWS goes down, then a multitude of Amazon's services also will go down(ex Amazonian myself), so it gives me a belief that AWS's uptime is more important to Amazon itself that it is for external customers.

boulos · on May 22, 2016

Search Infra (and Ads for that matter) is an extreme case. Google Search might be one of the worlds most highly tuned infrastructure projects: a marriage of code and hardware design to maximize performance, scoring, relevance and ultimately ROI.

Before we had custom machine types (November 2015 GA), we wouldn't have been remotely close to what they needed. I'm not even sure we've had anyone evaluate the amount of overhead KVM adds in either latency or throughput.

tl;dr: Don't let Search be your "not until they do it". We've got folks in Chrome, Android, VR, and more building on top of Cloud (as well as much of our internal tooling being on App Engine specifically).

tlarkworthy · on May 19, 2016

Google Firebase uses GCP including GCE extensively.

(Firebase Engineer here)

jbooth · on May 19, 2016

"So" Google lost potential business for a while from people who wanted to spin up VMs rather than wanting to ship code to a proprietary execution framework.

boulos · on May 22, 2016

I think you're agreeing: We certainly missed a huge segment of the market at the time, but now that we've got GCE new business can certainly come our way.

vgt · on May 18, 2016

If I recall correctly myself (getting old!), Google has had the ability to build your own custom image since GCE went GA 2.5 years ago. Now, admittadly, it took a while to get IAM and VPC going, but we done did it now!

I'd love to hear what other boring stuff has been a showstopper for you, in case we missed something dumb :)

pritambarhate · on May 19, 2016

Not having Managed Postgres and Managed Redis are 2 main showstoppers for me.

vgt · on May 19, 2016

True but There are 3rd party managed services from both...

hueving · on May 19, 2016

Minus the massive downtime that just occurred recently. https://status.cloud.google.com/incident/compute/16007

Not implying that AWS hasn't had them. It's just that adopting GCE this early makes you a bit of a guinea pig because GCE isn't used internally at Google.

eva1984 · on May 18, 2016

Meaning even deeper level of vendor lock -- now you cannot even find the chips to run your application elsewhere!

dgacmu · on May 18, 2016

No - tensorflow is open source and you can run it on many platforms. TPUs are about efficiency. You might not be able to do image recognition as efficiently without one, but you can still perform exactly the same tasks.

(I work on TF this year.)

eva1984 · on May 18, 2016

Well, that is good to know. Thanks for clarification.

Retric · on May 18, 2016

I would be shocked if tensorflow optimizations where useful 1:1 for stock Intel chips or GPU's. So, there is still plenty of lock-in even if your process runs. GPU vendors love to play this game by helping optimize games.

micro_cam · on May 18, 2016

All of the comparable tools are practically locked into nVidia gpus/CUDA. TensorFlow is rapidly reaching performance parity on that hardware [1] and can now be run on this so it's actually sort of the least locked in framework.

[1] It has been climbing the charts at https://github.com/soumith/convnet-benchmarks for example.

dgacmu · on May 18, 2016

It's possible, but I think that the majority of ML optimization as seen by a programmer using tensorflow is more about optimizing the balance of accuracy, training & inference speed, and memory use, and a lot of the solutions in this space are pretty hardware independent. There's an entire other type of optimization about, e.g., making conv2d insanely fast, but that's not something that a typical data scientist-type user deals with.

(To elaborate -- it's questions like "how deep should I make this convolution? Should I use tf.relu or tf.sigmoid? How many fully-connected layers should I put here, and how big should I make them?". These are really knotty deep learning design questions, but they're often h/w independent. Not always - we certainly have some ops on TF that we only support in CPUs and not on GPUs, for example - but often.)

Retric · on May 18, 2016

I am more thinking in terms of:

  1. Best price/performance is tensorflow right now. So, the best software choice is platform X.  
  2. Then in 2 years.. Well we are using Platform X so tensorflow is clearly the best option.

In other words once you pick conv2d, you tend to also stick with whatever conv2d is optimized for. Which also means HW vendors love to help optimize popular platforms.

visarga · on May 19, 2016

Others can make TF hardware too. Now that Google has opened the way, they will compete.

_pfwi · on May 18, 2016

- Google Container Engine

- Cloud Shell

to name a couple more.

merb · on May 18, 2016

to start a REAL business you SHOULD act boring. for everything else there is google.

manigandham · on May 18, 2016

...are you saying Google isn't a real business?

merb · on May 18, 2016

no it's just that google has so many shiny things that come and go, if you build a business with shiny new things, you will propably fail if you do that over and over again, just because its shiny and new.

manav · on May 18, 2016

Interesting. Plenty of work has been done with FPGAs, and a few have developed ASICs like DaDianNao in China [1]. Google though actually has the resources to deploy them in their datacenters.

Microsoft explored something similar to accelerate search with FPGAs [2]. The results show that the Arria 10 (20nm latest from Altera) had about 1/4th the processing ability at 10% of the power usage of the Nvidia Tesla K40 (25w vs 235w). Nvidia Pascal has something like 2/3x the performance with a similar power profile. That really bridges the gap for performance/watt. All of that also doesn't take into account the ease of working with CUDA versus the complicated development, toolchains, and cost of FPGAs.

However, the ~50x+ efficiency increase of an ASIC though could be worthwhile in the long run. The only problem I see is that there might be limitations on model size because of the limited embedded memory of the ASIC.

Does anyone have more information or a whitepaper? I wonder if they are using eAsic.

[1]: http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=701142...

[2]: http://research.microsoft.com/pubs/240715/CNN%20Whitepaper.p...

hedgehog · on May 18, 2016

You can build an ASIC with fast external memory, it adds to the cost but then you can handle larger models similar to a GPU. Software support is an issue but for deep learning applications there's no reason in principle you couldn't add support to TensorFlow etc for new hardware to make it simple for application developers to adopt. Movidius has announced that they're doing this and it's likely that other ML chip vendors will do the same.

emcq · on May 19, 2016

External memory kills your power budget. Efficient processors, whether they be GPUs, cpus, or asics all gain efficiencies by keeping data local to computation. Having your chip focused on off chip data is difficult to optimize power for if you don't have compute bound problems.

petra · on May 19, 2016

>> I wonder if they are using eAsic.

If you read the academic literature, they talk about 100x-3000x energy savings with asic vs GPU. So in that light Google's 10x improvement sounds low, and could certainly fit an eAsic story.

Furthermore, eAsic has fixed wires, but logic is defined via sram configuration(AFAIK) and not fixed, so it could offer a level of programmability .

semisight · on May 18, 2016

This is huge. If they really do offer such a perf/watt advantage, they're serious trouble for NVIDIA. Google is one of only a handful of companies with the upfront cash to make a move like this.

I hope we can at least see some white papers soon about the architecture--I wonder how programmable it is.

nickysielicki · on May 18, 2016

There's no way Google lets this leave their datacenters. Chip fabrication is a race to the bottom at this point. [1]

Google is doubling down on hosting as a source of future revenue, and they're doing that by building an ecosystem around Tensorflow.

What I think is interesting is how weak Apple looks. Amazon has the talent and money to be able to compete with Google on this playing field. Microsoft is late, but they can, too.

Where's Apple? In the corner dreaming about mythical self-driving luxury cars?

[1]: http://spectrum.ieee.org/semiconductors/design/the-death-of-...

honkhonkpants · on May 18, 2016

Apple designs their own CPUs. I think they'd be able to field a massively parallel FMAC chip if they thought that was a good idea.

Where Apple really looks weak is in datacenters, networking, and cloud services.

nickysielicki · on May 18, 2016

What does the iPhone of 2021 look like?

I get the feeling from today's announcements that Google sees the 2021 version of Google Now as the selling point for their 2021 Nexus line.

I don't think Apple is preparing to compete on that.

Razengan · on May 19, 2016

Apple's strength is in consumer (and to a lesser extent, developer) ecosystems; the cozy comfortable bubble you get when you're surrounded by everything Apple. Getting access to your stuff across multiple devices is virtually effortless and continually seamless, with almost no configuration required.

Whether that's good or not may be arguable, but it's certainly a selling point for many and I don't see Google or any other company's offerings approaching the same experience, and I suspect that's by design; they have to be more open and support all devices but that kinda dilutes everything. Apple will only get stronger in that aspect IMO.

habitue · on May 19, 2016

I would say they're already not competing on the assistant side. Siri is considerably worse than Google Now, even though it came out first

semisight · on May 18, 2016

I'm not sure I agree with you, long term (about the chips). I think that the value here is in the ecosystem. If Google can compete with CUDA, they'll be doing really well.

kuschku · on May 18, 2016

> There's no way Google lets this leave their datacenters. Chip fabrication is a race to the bottom at this point. [1]

I’d hope someone somewhere steals the blueprints and posts all of them publicly online.

The whole point of patents was that companies would publish everything, but get 20 years of protection.

But by now, especially companies like Google don’t do so anymore – and everyone loses out.

EDIT: I’ll add the standard disclaimer: If you downvote, please comment why – so an actual discussion can appear, which usually is a lot more useful to everyone.

vidarh · on May 19, 2016

There's little need for anyone to steal the blueprints. It's unlikely there's anything particularly "special" there other than identifying operations in Tensorflow that take long enough and are carried out often enough and are simple enough to be worth turning into an ASIC. If there's a market for it, there will be other people designing chips for it too.

cm3 · on May 19, 2016

Same misuse happens with copyright. Both were invented to foster publishing and not creating life-long monopolies. The life times of copyright and patents must be way shorter as well. Everyone builds on something that came before. It's impossible to build a better bike if you have to test drive on a street with patent mines.

Re EDIT: Downvotes must be comment-mandatory or not allowed otherwise.

codecamper · on May 18, 2016

I don't see any mention of offering these chips for sale. You can rent them it seems via cloud offerings & that's it.

babo · on May 18, 2016

Sure, but that's the deal. I'll buy the latest nVidia 1080 card as soon as I can but renting these custom chips per minute would be a way better option for me.

retox · on May 18, 2016

GPUs also have this nice side effect of being great at playing games on. Purely as a guess I'd think that the gaming market is bigger than the AI researcher market.

omarforgotpwd · on May 18, 2016

In a future where AI is everywhere, Nvidia hopes it can sell GPUs by the hundreds and thousands to large data centers. You can make a lot more money a lot faster selling your hardware this way, and Nvidia is very interested in it judging from how much they talked about it at their recent conference.

retox · on May 18, 2016

I would be surprised if they weren't working on their own specialised chips then, though Google have the advantage of already having the software specs to build for.

CydeWeys · on May 19, 2016

> Purely as a guess I'd think that the gaming market is bigger than the AI researcher market.

Machine learning isn't just targeting the AI researcher market though -- it's widely used by a huge number of companies, and of course, by many of Google's most important products. I would argue that those markets combined are larger than gaming.

semisight · on May 18, 2016

Yup, I assume they're gonna keep them in house as a competitive advantage for a time. I doubt they'll do it forever; the most valuable part of NVIDIA's CUDA is the ecosystem, and I think Google knows that.

knorker · on May 18, 2016

I would assume that the API to use these is tensorflow.

So... just use Google's machine learning cloud thingy.

The software can build the community, where the supercharging is only available when you run it on Google cloud.

(although GPU performance isn't bad either, so you don't have to, thus community)

mtgx · on May 18, 2016

Quantum computers, OpenPower, RISC-V, and now this - I'm really liking Google's recent focus on designing new types of chips and bringing some real competition into the chip market.

DSingularity · on May 18, 2016

What are they doing with RISC-V?

PeCaN · on May 18, 2016

They dumped a bunch of money into it, so presumably they're at least interested.

dharma1 · on May 18, 2016

it's ASIC tuned for specific calculations, I'm sure it's better power consumption than general purpose GPUs. Same as crypto mining ASIC's crush GPU's in terms of power efficiency.

There isn't much data yet but I'm also guessing they probably have access to much more RAM than NVidia cards and can process much bigger data sets

agumonkey · on May 18, 2016

I'm surprised by the perf claims. Nvidia isn't doing kids play. The graph implied they were untouchable in terms of perf...

azinman2 · on May 18, 2016

Ndvidia has to be general purpose. This is not and thus can be better optimized.

fiatmoney · on May 18, 2016

"General purpose" isn't that general, if you look at the actual operations they support and their threading model. It's already fairly optimized for these sorts of operations, and this amount of claimed headroom makes me suspicious.

Symmetry · on May 18, 2016

Google has a lot of potential options that NVidia doesn't have. They can size their cache heirarchy to the task at hand. They can partition their memory space. They can drop scatter/gather. They can gang ALUs into dataflows that they know are the majority of machine learning workloads. They can partition their register file at the ISA level or maybe even drop it entirely. They can drop the parts of the IEEE754 floating point spec they don't need and they can size their numbers to the precision they need.

azinman2 · on May 18, 2016

The fact that I can compile arbitrary programs for the GPGPU means it is general purpose. NVIDIA isn't writing softmax or backprop into silicon as a CPU instruction.

Look at how much faster ASICs for bitcoin mining are than the GPU... orders of magnitude.

fiatmoney · on May 18, 2016

"Backprop" isn't even close to something that would be a "CPU instruction", it's an entire class of algorithm. It's like saying "calculus" should be a CPU instruction. Matrix multiplication & other operations, on the other hand, do neatly decompose into such instructions, which have been implemented by NVidia et al., since that's the core set of functionality they've been pushing for like a decade now.

Additional die space on additional functionality might hurt the power envelope (which is where the focus on performance / watt rather than performance kicks in) but it doesn't make your chips slower per se.

agumonkey · on May 18, 2016

That was my impression too. ML under the hood was a lot of linear algebra, not very different than most shaders. But maybe Google decided to hardcode a few important ML primitives because the ROI was that good in terms of grabbing customers. Also they might have very large scale applications not found elsewhere that motivates this.

azinman2 · on May 18, 2016

Ok I was obviously oversimplifying things but my point is since we can only speculate, it's clear that when you know specific algorithms/math operations/memory layouts/applications you want to optimize for you can create dedicated chips that optimize and do that quickly. That bitcoin miners are all dedicated chips and run circles around GPUs demonstrates exactly this fact.

Furthermore the fact that ML can be error tolerant means you also get to optimize certain floating point operations for speed or energy efficiency at the cost of accuracy. NVIDIA doesn't get to do this in their linear algebra support.

vintermann · on May 19, 2016

Bitcoin mining is an extremely well-defined task compared to machine learning. It remains to be seen how general these TPUs are in practice - whether they will support the neural network architectures common two years from now.

agumonkey · on May 18, 2016

tbh I felt like realizing what you meant earlier at the end of my comment. I should have ps'd it.

emcq · on May 19, 2016

If they balance compute to memory better than GPUs, you could definitely see a 10x. GPUs have large off chip memory and small caches (like 256kb). Cost to going to off chip memory can be 1-2 orders of magnitude more than on chip memory. You can certainly fit 4+MB on modern processors, but they likely bought designs from a company like Samsung because designing high performance, low power memory cells is tricky. I'm surprised they were able to keep things a secret.

honkhonkpants · on May 18, 2016

What graph?

agumonkey · on May 18, 2016

In the talk there was a two bar graph

    {(others, ~bottom) (google, ~top)}

Couldn't see more, but after Nvidia claiming overwhelming power with their latest GPU architecture including in the ML domain .. I was surprised.

mrpippy · on May 18, 2016

Bah, SGI made a Tensor Processing Unit XIO card 15 years ago.

evidence suggests they were mostly for defense customers:

http://forums.nekochan.net/viewtopic.php?t=16728751 http://manx.classiccmp.org/mirror/techpubs.sgi.com/library/m...

Keyframe · on May 19, 2016

Damn, what a riveting read would be to find out what IT toys defense has now! Even an informed speculation would be nice to read.

hedgehog · on May 19, 2016

The term is pretty generic, Nervana also calls their chip a "tensor processing unit".

vt240 · on May 19, 2016

I remember going to see the SGI Origin 2000 installation at University of Alaska, Fairbanks. Pretty neat, I seem to remember they had a Cray T3E 900 and some J vector systems there too.

jhartmann · on May 18, 2016

3 generations ahead of moore law??? I really wonder how they are accomplishing this beyond implementing the kernels in hardware. I suspect they are using specialized memory and an extremely wide architecture.

Sounds they also used this for AlphaGo. I wonder how badly we were off on AlphaGo's power estimates. Seems everyone assumed they were using GPU's, sounds like they were not. At least partially. I would really LOVE for them to market these for general use.

CydeWeys · on May 18, 2016

It seems entirely reasonable to me, and there's good historical precedent for it. Let's look at SHA256 hashing as an example. The maximum number of hashes that the best GPU around can do is around 1 GHash/s. However, for the same cost, of around $600, specialized hardware can do around 5 THash/s. That's about five thousand times the performance/price. There's no reason that hardware that is super-specific to neural network computation can't similarly have large gains.

Link to the specialized hardware for SHA256 hashing: http://www.amazon.com/Antminer-~4-73TH-25W-Bitcoin-Miner/dp/...

jhartmann · on May 18, 2016

Understood, but GPU's are very good at linear algebra. GPU's are HORRIBLE for crypto, just very parallel.

CydeWeys · on May 18, 2016

Yeah, that's why the improvement isn't as big as the 5,000x it was for SHA256 hashing.

sputknick · on May 18, 2016

these are ASICs, Application Specific Integrated Circuits, emphasis on the SPECIFIC. It's a chip built specifically for Tensor Flow. Anytime you build a chip to handle a specific application you are going to see a significant performance improvement. You can move into a new apartment using a Honda Civic, but you are going to see considerable performance improvement using a vehicle designed specifically for moving.

reitzensteinm · on May 18, 2016

But isn't 3 generations ahead just 8x? Which doesn't sound at all unreasonable for a custom hardware.

Symmetry · on May 18, 2016

The rule of thumb I was taught was that going from a DSP/GPU to a custom ASIC would give you a 10X advantage in performance/power which is pretty close to this. And look at how much bitcoin mining ASICs out compete GPUs.

dnautics · on May 18, 2016

This is about right! 64-bit IEEE fp -> 16-bit IEEE-style fp[0] is a 4x bit size reduction, and multiplication is O(n^2) is silicon transistor count.

[0] If google is smart, they'd ditch +/- infinity and if they were ballsy, they'd ditch zero in their FP implementation.

Symmetry · on May 18, 2016

Generally speaking GPUs are already very good at running with float32s, usually much better than they are at using float64s in fact. The big advantages of using an ASIC are mostly on the storage side but they also allow you to get away with non IEEE floating point numbers that don't necessarily implement subnormals, NaN, etc.

kevinnk · on May 18, 2016

I doubt FP hardware size is the limiting factor in their implementation (it's not in GPUs and definitely not in high perf CPUs). More likely they came up with higher level architectural tricks that let them specialize for machine learning (i.e. taking better advantage of locality in the application, etc).

_r5wf · on May 18, 2016

Actually for many neural network apps 8, 4 even 1 bit is sufficient for representing weights.

BooneJS · on May 18, 2016

From the article: "TPU is tailored to machine learning applications, allowing the chip to be more tolerant of reduced computational precision, which means it requires fewer transistors per operation."

euyyn · on May 18, 2016

Do you reckon that means it's using small floats?

protomok · on May 18, 2016

Based on recent blog posts from some Google folks regarding quantizing neural nets I'm going to guess 8-bit fixed point.

For example -> https://petewarden.com/2016/05/03/how-to-quantize-neural-net...

nhaehnle · on May 18, 2016

Probably.

In addition, there's a lot of literature on optimizing hardware implementations of fundamental arithmetic operations like addition and multiplication. I recall seeing a paper a while ago which talked about reducing the number of gates by allowing some bounded imprecision in the results - unfortunately, I don't remember the title right now, but it sounds like that's what they may be doing.

tcarey83 · on May 19, 2016

When I first read the blog post earlier in the day, it actually said they were using 8-bits. I remember it because it seem quite small to me.

pygy_ · on May 18, 2016

Or small ints, possibly working in log space.

j1vms · on May 18, 2016

The article confirms that "AlphaGo was powered by TPUs in the matches against Go world champion, Lee Sedol, enabling it to "think" much faster and look farther ahead between moves."

asimuvPR · on May 18, 2016

Now this is really interesting. I've been asking myself why this hadn't happened before. Its been all software, software, software for the last decade or so. But now I get it. We are at a point in time where it makes sense to adjust the hardware to the software. Funny how things work. It used to be the other way around.

kens · on May 18, 2016

This is known as the Wheel of Reincarnation. Functionality moves to special-purpose hardware then back to software, and the cycle repeats. (The computer term is from 1968 so this has been happening for a long time.)

http://www.catb.org/jargon/html/W/wheel-of-reincarnation.htm...

aab0 · on May 18, 2016

I've been thinking for a while that with the end of silicon shrinkages, we may start seeing the final cycles of the wheel, with a final stop at mostly specialized hardware for greater power efficiency.

asimuvPR · on May 18, 2016

Not only power efficiency but software efficiency. Custom chips combined with DSLs are a powerful combination. At the expense of segmentation, of course.

asimuvPR · on May 18, 2016

Wonder what it will do the the industry term Full Stack Developer. Will people who call themselves that will now need to know about chip design?

OT: Cool blog! :)

dekhn · on May 18, 2016

At some point in the past, Rob Pike mentioned that when we was working on Voyager (that spaceship that almost 40 years after launch, has left the solar system and continues to send back valuable science data), he had a relatively good understanding of the system from the quantum level (transistors are based on quantum theory) to the solar system. He wasn't kidding, either.

asimuvPR · on May 18, 2016

Very interesting fact. But the average programmer is not Rob Pike. How do you see this panning out for the average programmer? Will people need to learn a bit about chips to build more efficient CRUD apps?

dekhn · on May 18, 2016

i'm an average programmer but reading "High Performance Computing" http://shop.oreilly.com/product/9781565923126.do made a huge difference for me even when writing more efficient CRUD apps. A big issue is understanding the cost of the operations in the stack you are using and the overheads caused by your stack.

protomok · on May 18, 2016

I think there will always be an API. In the case of Deep Learning you already see Caffe, TensorFlow, etc. readily available for developers.

I don't think the average developer will need to understand chip design but I do think _many_ developers will need to know how to use deep learning frameworks.

asimuvPR · on May 18, 2016

It had definitely started as an API side detail. But it might become an industry thing. It allows very easy vendor lock-in for all the *AAS products (PAAS, SAAS, etc). You could be required an add-on for your server or machine to be able to use their product.

Oh, and imagine a Facebook chip. :)

pwang · on May 19, 2016

I dunno... I'm not Rob Pike, but I could basically describe computing from the quantum level to the solar system dynamics and electromagnetic influences of a robotic interplanetary craft.

Then again, I've been coding for ~25 years, am an avid amateur astronomer, and have a degree in physics, so maybe the moral here is that some things just just take more time to master, beyond those first 6 weeks in a coding camp learning how to put together your first jQuery.

:-)

protomok · on May 18, 2016

Interesting, I hadn't heard of this term before but it sounds about right!

Maybe the next iteration will be assembly instructions specifically for Neural Nets built into CPUs...actually something like a convolution assembly instruction wouldn't even surprise me at this point.

grkvlt · on May 19, 2016

See also the VAX. It had an assembly instruction to evaluate a polynomial based on a table of coefficients in memory: POLY

http://uranium.vaxpower.org/~isildur/vax/week.html

breatheoften · on May 18, 2016

A podcast I listen to posted an interview with an expert last week saying that he perceived that much of the interest in custom hardware for machine learning tasks died when people realized how effective GPUs were at the (still-evolving-set-of) tasks.

http://www.thetalkingmachines.com/blog/2016/5/5/sparse-codin...

I wonder how general the gains from these ASIC's are and whether the performance/power efficiency wins will keep up with the pace of software/algorithm-du-jour advancements.

hyperopt · on May 19, 2016

I listen to the Talking Machines as well. Great podcast. Another question would be are the gains worth the cost of an ML-specific ASIC. GPUs have the entire, massive gaming industry driving the cost down. I suppose that as adoption of gradient-descent-based neural networks increases, it may be worth the cost in a similar way that GPUs are worth the cost. Then again, I have never implemented SGD on a GPU so I'm not aware if there are any bottlenecks that could be solved with an ML-specific ASIC. Can anyone else shed some light on this?

indolering · on May 19, 2016

> massive gaming industry driving the cost down.

Per-unit manufacturing cost scales logarithmically. Even a single batch of custom silicon on yesterday's technology is only $30K. This is one of the reasons there is so much interest in RISC-V; hardware costs are not the barrier-to-entry that they used to be.

So yeah, the gaming market pushes the per-unit price of GPUs down, but even an additional 2x reduction in rackspace and power will pay for itself at the right scale.

RIMR · on May 18, 2016

Somewhat off topic, but if you look at the lower-left hand corner of the heatsink in the first image, there's two red lines and some sort of image artifact.

https://2.bp.blogspot.com/-z1ynWkQlBc8/VzzPToH362I/AAAAAAAAC...

They probably didn't mean to use this version of the image for their blog - but I wonder what they were trying to indicate/measure there.

jhayward · on May 19, 2016

It kind of looks like the original image had a cutaway section there and they didn't want that to be seen so pasted a top surface back on.

cm3 · on May 18, 2016

Did the image change? I can't seem to find what you're describing.

davidlakata · on May 18, 2016

It's still there: the red lines outline the lower left hand corner of the heatsink (the big metallic structure).

cm3 · on May 18, 2016

Got it. That looks like a bad digital image stitching job. Especially the misaligned fins. But the red lines are odd indeed.

danielvf · on May 18, 2016

For the curious, that's a plaque on the side if the rack showing the Go board at the end of AlphaGo vs Lee Sedol Game 3, at the moment Lee Sedol resigned and AlphaGo won the tournament (of five games).

visarga · on May 19, 2016

By the way, more merit for Lee Sedol now that we know what he played against.

nkw · on May 18, 2016

I guess this explains why Google Cloud Compute hasn't offered GPU instances.

hyperopt · on May 19, 2016

That's what I'm thinking. I was anticipating the release of GPU instances, but now I'm thinking that they will simply leapfrog over GPU instances straight to this.

fiatmoney · on May 18, 2016

I'm guessing that the performance / watt claims are heavily predicated on relatively low throughput, kind of similar to ARM vs Intel CPUs - particularly because they're only powering it & supplying bandwidth via what looks like a 1X PCIE slot.

IOW, taking their claims at face value, a Nvidia card or Xeon Phi would be expected to smoke one of these, although you might be able to run N of these in the same power envelope.

But those bandwidth & throughput / card limitations would make certain classes of algorithms not really worthwhile to run on these.

wyldfire · on May 19, 2016

> IOW, taking their claims at face value, a Nvidia card or Xeon Phi would be expected to smoke one of these, although you might be able to run N of these in the same power envelope.

That seems unlikely, right? GPGPU software beating an ASIC? I guess it depends on just how abstract and adaptable the TPU IC is. Sounds like they use a custom number representation -- if they can squeeze it into less than half precision (IEEE FP16) then it would be super hard for a Phi or GPU to beat it.

But ultimately it all comes down to specific applications' bottlenecks. If they don't have to go off-card ever and their workset fits into the on-die memory, then they'd have no real advantage by using a PCIe GPU with GDDRx on it.

arcanus · on May 18, 2016

> I'm guessing that the performance / watt claims are heavily predicated on relatively low throughput, kind of similar to ARM vs Intel CPUs - particularly because they're only powering it & supplying bandwidth via what looks like a 1X PCIE slot.

Agreed. Also tells you that they don't need to communicate with the CPU much, given that it only has a PCIE. Reminds me of Knights Ferry, in this respect.

> a Nvidia card or Xeon Phi would be expected to smoke one of these

Will be very interesting to see some head-to-head benchmarks between these guys (on tensorflow and other libraries) in the next few months. Especially as Knights Landing starts to appear, and the new Nvidia card.

visarga · on May 19, 2016

I am so happy that Nvidia got a real competitor in the ML hardware market. Maybe now they will be two times more creative.

mng2 · on May 18, 2016

There are 8 high-speed serial links on the front side of that board and probably 8 on the back. That's quite a bit of bandwidth.

bravo22 · on May 19, 2016

Given the insane mask costs for lower geometries, the ASIC is most likely an Xilinx EasyPath or Altera Hardcopy. Otherwise the amortization of the mask and dev costs -- even for a structured cell ASIC -- over 1K unit wouldn't make much sense versus the extra cooling/power costs for a GPU.

nickpsecurity · on May 19, 2016

Don't forget shuttle runs. Adepteva used those and otherwise good engineering practice to develop two products, latest in 65nm, with no more than $2mil. This one might be simpler and cheaper given its requirements.

http://www.adapteva.com/andreas-blog/a-lean-fabless-semicond...

bravo22 · on May 19, 2016

True. That's another possibility.

I'd imagine since they'd want to squeeze every performance per want out of the chip they'd want to go for smallest node possible. Virtex7 EasyPath is 16nm! It is pin to pin compatible with the FPGA version -- because they just change the mask layer and you get it in about 6 weeks. Hard to beat that.

nickpsecurity · on May 19, 2016

I didn't know they were still doing EasyPath. And I surely didn't know they did pin-for-pin on 16nm. Holy crap! :)

bravo22 · on May 19, 2016

Yes. It is very popular for high-end FPGA use cases. Essentially instead of relying on the built-in routing matrix which makes FPGAs what they are, they modify the metal layer and connect the chip per your design as an ASIC. You get much lower power consumption and much faster clock rates. You also get it in about 6-8 weeks and is guaranteed to match your original design in functionality. It is 100% pin compatible because it is the same base silicon and packaging.

nickpsecurity · on May 19, 2016

Cool stuff. Previously, I was looking at eASIC or Triad if I needed this cuz I thought FPGA people cancelled S-ASIC's. Good to know there's a high-end one from Xilinx. Here's the others in case you didn't know about them:

http://www.easic.com/products/28-nm-easic-nextreme-3/

https://www.triadsemi.com/reconfigurable-full-custom-asic/

eASIC has a maskless capability where they straight-up print your silicon for prototyping/testing. Triad brought S-ASIC's to analog/mixed-signal. They're top players. eASIC's basic prototyping was $50k for 50 chips on older ones. Idk now. Triad I heard is $400k flat. Need a price quote to be sure. ;)

bravo22 · on May 19, 2016

That's awesome. I had heard about them before but never used. Crazy how low that price is.

nickpsecurity · on May 19, 2016

Yeah, Triad is still in startup mode and picky. eASIC has been around quite a while. They also have ezCopy or something to produce ASIC's from their S-ASIC's. A side benefit is there's lots of pre-tested IP, including Gaisler OSS CPU.

So, worth considering. I need to get numbers on Xilinx, though, in terms of pricing and royalties. Esp if they have something for 28nm, 45nm, or 65nm that will be significantly cheaper than other one.

mkj · on May 19, 2016

I would suspect they're aiming for orders of magnitudes > 1k units though?

bravo22 · on May 19, 2016

Unless they're doing 100K plus for 2-3 year life of a chip, it makes sense to stay w/ EasyPath or the Altera Hardcopy. It is much easier and more flexible for making revisions.

Consider that they will most likely make another version, with new features in 2 years time. I'm sure the users of the chip will want that, just like any other system.

erichocean · on May 20, 2016

Altera Hardcopy has been discontinued.

https://www.altera.com/products/general/asic.highResolutionD...

> Altera no longer offers HardCopy structured ASIC products for new design starts. Altera continues to support HardCopy for existing designs.

Coding_Cat · on May 19, 2016

I wonder if we will be seeing more of this in the (near) future. I expect so, and from more people then just Google. Why? Look at the problems the fab labs have had with the latest generation of chips and as they grow smaller the problems will probably rise. We are already close to the physical limit of transistor size. So, it is fair to assume that Moore's law will (hopefully) not outlive me.

So what then? I certainly hope the tech sector will not just leave it at that. If you want to continue to improve performance (per-watt) there is only one way you can go then: improve the design at an ASIC level. ASIC design will probably stay relatively hard, although there will probably be some technological solutions to make it easier with time, but if fabrication stalls at a certain nm level, production costs will probably start to drop with time as well.

I've been thinking about this quite a bit recently because I hope to start my PhD in ~1 year, and I'm torn between HPC or Computer Architecture. This seems to be quite a pro for Comp. Arch ;).

phsilva · on May 19, 2016

I wonder if this architecture is the same Lanai architecture that was recently introduced by Google on LLVM. http://lists.llvm.org/pipermail/llvm-dev/2016-February/09511...

startling · on May 19, 2016

No, this is an ASIC. It's not general-purpose.

taliesinb · on May 18, 2016

I don't know much about this sort of thing but I wonder if the ultimate performance would come with co-locating specialized compute with memory, so that the spatial layout of the computation on silicon ends up mirroring the abstract dataflow dag, with fairly low-bandwidth and energy efficient links between static register arrays that represent individual weight and grad tensors. Minimize the need for caches and power hungry high bandwidth lanes, ideally the only data moving around is your minibatch data going one way and your grads going the other way.

I wonder if they're doing that, and to what degree.

harigov · on May 18, 2016

How is this different from - say - synthetic neurons that IBM is working on, or what nvidia is building?

DannyBee · on May 18, 2016

Without sounding crass:

1. It works already (IE it's already in use)

2. It works really well (or else they wouldn't be using it so broadly)

3. Considering how long this was said to be in development, it also likely means they are working on the next big improvement before these guys have even gotten the current one working.

bgalbraith · on May 18, 2016

IBM's TrueNorth chip is taking a much more neuromorphic design approach by trying to approximate networks of biological neurons. They are investigating a new form of computer architecture away from the classic Von Neumann model.

TPUs are custom ASICs that speed up math on tensors i.e. high-dimensional matrices. Tensors feature prominently in artificial neural networks, especially the deep learning architectures. While GPUs help accelerate these operations, they are optimized first and foremost for video rendering/gaming applications -- compute-specific features are mostly tacked on. TPUs are optimized solely for doing ML-related computations.

daveguy · on May 18, 2016

They are already using it in production and have been for over a year? That is a huge difference compared to what people are working on and building. They beat Lee Sedol using this hardware. Over 100 teams at Google are using machine learning. This hardware already accelerates their projects.