Nvidia CEO Reveals New TITAN X at Stanford Deep Learning Meetup

cs702 · on July 22, 2016

Great news all-around for deep learning practitioners.

Nvidia says memory bandwidth is "460GB/s," which will probably have the most impact on deep learning applications (lots of giant matrices must be fetched from memory, repeatedly, for extended periods of time). For comparison, the GTX 1080's memory bandwidth is quoted as "320GB/s." The new Titan X also has 3,584 CUDA cores (1.3x more than GTX 1080) and 12GB of RAM (1.5x more than GTX 1080).

We'll have to wait for benchmarks, but based on specs, this new Titan X looks like the best single GPU card you can buy for deep learning today. For certain deep learning applications, if properly configured, two GTX 1080's might outperform the Titan X and cost about the same, but that's not an apples-to-apples comparison.

A beefy desktop computer with four of these Titan X's will have 44 Teraflops of raw computing power, about "one fifth" of the raw computing power of the world's current 500th most powerful supercomputer.[1] While those 44 Teraflops are usable only for certain kinds of applications (involving 32-bit floating point linear algebra operations), the figure is still kind of incredible.

[1] https://www.top500.org/list/2015/06/?page=5

pavlov · on July 22, 2016

A beefy desktop computer with four of these Titan X's will have 44 Teraflops of raw computing power, about "one fifth" of the raw computing power of the world's current 500th most powerful supercomputer.

There have been calls that Moore's Law is dead for at least 10 years, and yet desktops keep catching up to supercomputing.

I know this increase is not in single-threaded general-purpose computing power in the fashion of the old gigahertz race... But on the other hand, the scope of what's considered "general purpose" keeps expanding too. Machine learning may be part of mainstream consumer applications in 5 years.

Cybiote · on July 22, 2016

Moore's law talks about scaling transistor density at a minimal cost. Its colloquial understanding is actually a particular case of the Experience Curve Effect with an added bonus. What made transistors special and unique for a while was the huge benefit from Dennard Scaling. However, the days where we could expect a next version to be of much smaller components while much faster at same power levels, for the same or lower price, are gone.

The experience and returns to scale are still plodding along but Dennard Scaling is gone and with it, its magic, vanished by physics.

ralusek · on July 22, 2016

Moore's law is a linear constant. Catching up to super computers could mean that progress on all fronts is slowing. Even if performance of super computers increases by 1.01x, and personal computers increase by 1.02x, that would still fit the bill of "catching up to supercomputers." Neither would be the required doubling required to fit the definition of Moore's law.

dave_sullivan · on July 22, 2016

Moore's law is more about what happens when you get a lot of people working on a clearly defined problem + network effects where a new breakthrough helps everyone else get nearer to another breakthrough. This is why you find Moores law in so many places. It studies rate of change in technological progress.

In a particular area (eg, single core cpu performance) it eventually follows a sigmoid curve (where all hockey sticks must go) as lower hanging fruit exposes itself (eg, higher parrallelism and GPU computing).

Kurzweil (say what you will) has written a lot on the topic.

astrodust · on July 22, 2016

The specs on this singular card would put it at the top of the Top 500 computer list from 2004: https://www.top500.org/lists/2004/06/

With four or six cards, as many systems built around this component are likely to be spec'd, you'd be able to build something that could fall in the top 5 in the last ten years.

fnord123 · on July 22, 2016

>Machine learning may be part of mainstream consumer applications in 5 years.

It doesn't seem likely since the Titan X pulls 250 watts. Maybe if we get asics for deep learning which consume less power; and a lot of research is done on deforesting trained networks so users can compute classifications more cost effectively, then sure.

joelschw · on July 22, 2016

It already is - SwiftKey has been doing it right on my phone for a while!

rasz_pl · on July 22, 2016

Moore's Law IS dead for single core CPU performance.

mrb · on July 22, 2016

"Moore's Law IS dead for single core CPU performance."

Moore's Law was never about single core CPU performance. It is about the number of transistors in a single chip; this has always been its definition.

rasz_pl · on July 23, 2016

ok, "Moore's Law benefits ENDED for single core CPU performance"

honkhonkpants · on July 22, 2016

Actually Moore's law is about the number of transistors you can etch for a single dollar.

astrodust · on July 22, 2016

Moore's Law: "Moore's law (/mɔərz.ˈlɔː/) is the observation that the number of transistors in a dense integrated circuit doubles approximately every two years."

That's the definition I've always known. It has nothing to do with money or speed or performance, but usually these things are correlated.

pavlov · on July 22, 2016

Yes. My point was that the scope of applications seems to be expanding (even on the consumer level) such that single-threaded performance on an x86-style instruction set and memory model isn't the end-all definition of computing power anymore.

emn13 · on July 22, 2016

Well... this means you can train models that are very slightly larger in the same time. It wont scale linearly. And those slightly larger models will typically be a very, very, very tiny bit better - quality scales far below linearly.

Calling this great news is exaggerating greatly.

cs702 · on July 22, 2016

Going by the specs, I would expect that training current models on the new Titan X will be be 1.3x to 2.0x faster than with the older Titan X. In practice, this means deep neural nets that now take four weeks to train will take two or three weeks instead. That's great news.

Retric · on July 22, 2016

If that maters, use slightly smaller networks.

astrodust · on July 22, 2016

With that attitude you could fit your model on a 6502 with 16KB of memory.

Give it up.

nl · on July 22, 2016

Bigger models don't (usually) mean dealing with higher quality images or something. They mean larger NN layers - and more of them - are able to fit in memory.

This is important for image stuff, true, but it becomes really important for things like Memory Networks for NLP tasks and Neural Turing Machine-like architectures. Bigger networks mean they can "remember" further "backwards" in their dependencies, and so do things they can't do before.

That's GREAT news.

Also, the speed up is pretty significant. People train for weeks at the moment - a 10% speed up means often means savings days. This might not seem much, but it lets you iterate much, much quicker.

emn13 · on July 23, 2016

From my own experience and from what I've seen others do, that's not what happens. People train for as long as they can afford to. If that is weeks, it's still going to be weeks. Instead, they're going to make the models slightly more sophisticated.

If people really cared about 20% speedups that much, they could train for 20% fewer iterations at a slightly higher learning rate. Or they could use very slightly less deep networks. Or slightly less wide netwoks. Or slightly less connected networks. Or. most realistically, some combination of many very small changes.

Of course it's nice to have faster machines, but this is hardly a big enough change to make a very noticeable difference, let alone a dramatic difference.

There's nothing wrong with iterative improvement, but let's not pretend it's a revolution.

smcleod · on July 22, 2016

also some times exaggerating == marketing

SEJeff · on July 22, 2016

With 4 of those in a desktop, you'd also have a serious cooling problem. Most likely, you'd burn them out. Note, I run many servers with > 4 of these types of GPUs in them for $day_job machine learning type of problems.

etangent · on July 22, 2016

Liquid cooling.

lloyd-christmas · on July 22, 2016

It's nothing to write home about when it comes to long-running applications. I have 2x 980tis with liquid cooling, and it runs pretty damn hot when going for a week at 90%+ capacity. I can't use it in the summer months, even leaving my AC on while I'm at work.

dharma1 · on July 22, 2016

might be relevant if your model size is large - two gtx 1080's would give you twice the gpu RAM

tylerwhipple · on July 22, 2016

As great as this product may be, I went to the Stanford Deep Learning Meetup to learn more about how Baidu Research/Andrew Ng are solving large scale deep learning problems. I am disappointed by how much (unannounced) time was dedicated to the keynote/sales pitch.

nl · on July 22, 2016

to learn more about how Baidu Research/Andrew Ng are solving large scale deep learning problems.

Throwing NVidia cards at it. Like everyone else.

https://github.com/baidu-research/persistent-rnn is a pretty interesting way to use all those NVidia cards though.

argonaut · on July 22, 2016

I'm pretty sure Andrew Ng was hired by Baidu as a recruiting tool / evangelist. He hasn't done much research recently.

nabla9 · on July 22, 2016

If his job as a Chief Scientst at Baidu is similar to the Director of Research at Google (Peter Norvig) he is too busy to be part of individual research projects.

He is probably one of the guys who decides where the direction of company research is heading and the one who supervises those projects.

tudorw · on July 22, 2016

A moment to champion the importance of a great evangelist, research is important, though without propagating the results and helping with 'real world's deployment, it might be effort wasted :)

kartD · on July 22, 2016

Since this question has come up so many times in the thread, my take is that FP64 and FP16 won't be as good at the GP100. If TITAN X is based on the consumer parts, it misses out on the GP100's FP improvements.

From Anandtech's GTX1080 review page2: As a result while GP100 has some notable feature/design elements for HPC – things such faster FP64 & FP16 performance, ECC, and significantly greater amounts of shared memory and register file capacity per CUDA core – these elements aren’t present in GP104 (and presumably, future Pascal consumer-focused GPUs).

This requires confirmation though, it depends on whether it uses the consumer chip or the HPC chip.

Edit: AT has an article up http://www.anandtech.com/show/10510/nvidia-announces-nvidia-...

They think it's likely a consumer card, so lower FP16 and FP64 perf. Should be a gaming monster though

mrb · on July 22, 2016

Well the Titan X is a new chip: GP102. So they could have picked and chosen features from either GP100 (professional Tesla) or GP104 (consumer GeForce).

We know almost certainly it has lower FP64 perf than GP100 because it has 22% fewer transistors and the FP64 units take a lot of transistors count. However it is less clear about what the FP16 performance is. Nvidia could have decided to match GP100 on that regard.

kartD · on July 22, 2016

Doubt it, why have three versions of a chip (one with fp16 improv, one with fp16/fp64 improv and one without them)? Especially when the Titan X has the same number of CUDA cores as the GP100.

jhj · on July 22, 2016

"CUDA cores" is a misleading term.

When they quote "CUDA cores", they've been counting float32 fma functional units; e.g., Tesla K40 has 192 float32 fma units per SM x 15 SMs => 2880 "CUDA cores".

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.h...

fp16 and fp64 are likely different functional units with different issue rates, as is the case with old hardware; unless they've managed to share the same hardware (since for P100 the quoted fp64 rate is exactly half the fp32 rate, and the fp16 is exactly double the fp32 rate).

kartD · on July 22, 2016

True, I forgot NVIDIA (and AMD) play around with the "core" definition

jhj · on July 22, 2016

There's no real analogue to a CPU core (or thread) on a GPU, there are warp schedulers (Nvidia) and functional units with varying throughput rates. The closest analogue to a CPU thread is the GPU warp (Nvidia), which shares a program counter and executes as a 32-wide vector. AMD wavefronts (64-wide) are a little bit different, but not by much. The CUDA "thread" is really an illusion that makes a SIMD lane easier to program to (sometimes...)

kartD · on July 22, 2016

I agree, I think an SMX is the closest to a CPU core - it contains dispatch, cache, schedulers etc. CUDA threads are of course have important differences since all threads in a warp move in lockstep. IMO, all of these contrived definitions are just to ease in programmers into the CUDA/OpenCL model or for bragging rights (like how a Jetson has 192 cores!).

dogma1138 · on July 22, 2016

The 22% transistor count could also be due to lack of HBM the memory bus and controllers for that take a big chunk of die space, but I'm also thinking that it would only do 1/2 of the fp64 that the tesla does. I'm disappointed at this Titan tbh 1200$ and no hbm2 memory? If AMD doesn't not royally screwup Vega might be time for a shift unless you explicitly need CUDA.

etangent · on July 22, 2016

> it is less clear about what the FP16 performance is

Very curious about this as well. I got two GeForce 1080s as soon as they came out, and was very sad to discover that the stated advantage of Pascal architecture (speedup on FP16) is completely lacking on those.

bryanlarsen · on July 22, 2016

For gaming, it's an estimated 24% faster than a 1080 for twice the price.

http://www.anandtech.com/show/10325/the-nvidia-geforce-gtx-1...

baq · on July 22, 2016

does it have a place where you can plug a display into...?

jsheard · on July 22, 2016

Yes, the spec sheet says it has DisplayPort 1.4, HDMI 2.0b and Dual-Link DVI.

http://www.geforce.co.uk/hardware/10series/titan-x/

The Tesla cards are the ones with no display outputs.

astrodust · on July 22, 2016

Nikolai Tesla was a titan so the confusion is natural.

pjmlp · on July 22, 2016

Surely, after all it supports DX 12, Vulkan and OpenGL 4.5.

Vexs · on July 22, 2016

Not surely- the old titan X didn't have any video ports on the back. It's a computing card, not a video card.

Meegul · on July 22, 2016

There has never been a titan card that did not have any video outs. Perhaps you're thinking of their Tesla cards?

Vexs · on July 22, 2016

Whoops, I was.

selectodude · on July 22, 2016

Titan X Maxwell definitely had video outputs on it.

http://www.anandtech.com/show/9059/the-nvidia-geforce-gtx-ti...

jfb · on July 22, 2016

I have two in the box under my desk, and they most assuredly do. Perhaps you're thinking of the Teslas?

stephanheijl · on July 22, 2016

I'm looking forward to seeing this in stores, as I've wanted to build a new machine learning rig for some time. The GTX 1080 just didn't seem like it would do the trick, with ostensibly limited software support and all.

I'm specifically wondering about FP16 handling though. Single precision FLOPS are never mentioned in the blog, nor on the NVidia page. It would be a shame if the FP16 units on this card were gimped in the same way as the GTX 1080...

kartD · on July 22, 2016

I don't think it'll be good. It's probably based on the same chips used in the GTX 1080 which doesn't have the FP16 improvements that GP100 has. I can't be a hundred percent on this since NVIDIA hasn't confirmed.

From Anandtech's 1080 review (page 2): "As a result while GP100 has some notable feature/design elements for HPC – things such faster FP64 & FP16 performance, ECC, and significantly greater amounts of shared memory and register file capacity per CUDA core – these elements aren’t present in GP104 (and presumably, future Pascal consumer-focused GPUs)."

xiphias · on July 22, 2016

NVIDIA is always publishing just the numbers that it wants to publish. There are still improvements, but they are slowing down.. I'm sure the engineers did their best.

The announcement skipped energy usage as well.

nitrogen · on July 22, 2016

The GTX1080 still isn't available anyway, unless you pay an exorbitant sum to a scalper.

rdl · on July 22, 2016

I've been able to get 1070 and 1080s at retail price by backordering through B&H; they took 2 extra days to ship but NBD.

Jach · on July 22, 2016

Funny, my friend had no issue getting two of them. The only impossible to get model is gigabyte's xtreme version and it's not even the 'best'.

nitrogen · on July 22, 2016

From which vendor, may I ask?

Jach · on July 22, 2016

Zotac Extreme.

nitrogen · on July 23, 2016

Cool. Who has it in stock?

Vexs · on July 22, 2016

A month ago you'd be right. As it is, all manufacturers are stocking their cards pretty regularly it you set up notifications for em.

nitrogen · on July 22, 2016

I've had notifications set up on Newegg for every variant and I have been notified for only two of the lesser variants, and by the time I check the site it's already out of stock.

jesperhh · on July 22, 2016

Interesting that this does not have HBM2 memory. Apparently this will only be for the Tesla GPUs on pascal GPUs unless they are going to put it on the 1080 ti, which does not seem likely when the Titan does not have it.

g0xA52A2A · on July 22, 2016

I don't see this showing up in consumer parts soon. It seems Nvidia can squeeze out GDDR5(X) outside of the HPC space for this generation. Which is good for cost but also reduces the risk in terms of reliability of throwing lots of new technology into a single product.

HPC obviously has different requirement but Nvidia can work with interrogators and customers with less of a backlash when fixing issues in this segment.

nl · on July 22, 2016

This is so great.

NVidia actually care about research, researchers and the scientific computing market.

Next time someone complains about the lack of OpenCL support, again, in another framework remember how much work NVidia puts into supporting people who use their cards for scientific computing, and how they listen to them.

Cybiote · on July 22, 2016

Microsoft also carefully listened to developers while building DirectX and by versions 8 and especially 9, it really showed. But only Windows benefited from this. Having the control of important GPU tech so strongly centered about a single company is never a good idea, it sets up a conflict of interest.

Something like OpenCL does not face the same conflict nVidia would if porting core APIs across a wide set of competing technologies. With CUDA, nVidia prioritizes themselves above AMD, intel, FPGAs and whatever parallel compute technology the future holds.

nl · on July 22, 2016

Oh, I agree 100%.

But the truth is that without the hardware vendors putting significant resources into OpenCL it just isn't competitive and won't be until that happens.

The truth is that most of the work in Deep Learning is developing new NN architectures and other algorithmic optimisations. If you are working in the field there is no reason to put up with second class support from non NVidia vendors - just build in TensorFlow, Torch or a couple of other frameworks and wait for the day (one day, we are promised!) when OpenCL is competitive. Then the framework backends get ported, your code keeps running the same, and it can run on all those other architectures.

Everyone has been waiting for that day since One Weird Trick[1]. There isn't really anything to indicate it is getting closer, and AMD's dismissal of the NVidia "doing something in the car industry"[2] doesn't give me a lot of confidence.

Anyway, I hope I'm wrong. Maybe Intel will step-up.

[1] https://arxiv.org/abs/1404.5997

[2] http://arstechnica.co.uk/gadgets/2016/04/amd-focusing-on-vr-...

robotresearcher · on July 22, 2016

> NVidia actually care about research, researchers and the scientific computing market.

They seem to be trying hard to create a new market, as Intel's integrated graphics are now good enough for most of the laptop and desktop market.

raverbashing · on July 22, 2016

This

While ATI was always about gaming, gaming, gaming, NVidia always worried about the Pro market (Quadro, Linux support, even with proprietary modules, etc) and now with Deep Learning

And now they can sell their deep learning processor for embedded applications (equals $$$)

astrodust · on July 22, 2016

There's a lot of promise here, but revenue? Where's the sales figures on those processors?

nl · on July 23, 2016

Is a significant source of revenue

There are some numbers for 2015 at [1]. In the Enterprise, HPC and Auto markets they had a bit more than $1B in revenue. I believe the Auto market includes some Tegra numbers, but that is "only" $180M.

Gaming is a little more than $2B, and "OEM & IP" is $1B.

[1] http://www.nextplatform.com/2015/05/08/tesla-gpu-accelerator...

imtringued · on July 22, 2016

Intel already surpassed the old Titan X with their Xeon Phi Knights Landing CPUs which have a peak performance of up to 7TFLOPS. It was about time that they release a new Titan X.

m_mueller · on July 22, 2016

I'll believe it when we actually get some real world benchmarks with these. Knight's corner was also overhyped quite a bit by Intel's peak performance claims - mainly because the tools just weren't nearly on the same level as Nvidia's.

To really make use of this kind of hardware, the compiler is just the start. You need debuggers, profilers, MPI integration, and, last but not least, a sensible programming model in the first place. Nvidia has been pushing that stack for close to 10 years now.

vegabook · on July 22, 2016

I think the idea is that the programming model is not much different from what we've all been doing on multicore for a while now already. Sure there'll be evolution, but it's not nearly as different as GPU compute. Also, I personally am hoping that concurrent algos will be much better suited to this than matrix-based GPUs which are basically good for massively parallel floating point only. For example, I'd love to see how a big fat BEAM VM will work on this, not to mention golang. IE I'm actually thinking that Knight's Landing will open up an oblique space where new things are possible, while not losing too much ground on the basic idea of the GPU which is masses of dumb floating point performance.

Even in Python. Imagine firing up a 288-hardware-core Multiprocessing pool using a AVX-512 enabled Numpy...without changing your code. Personally that's numerical porn for me and I want Intel to shup up and take my money, now.

m_mueller · on July 22, 2016

not having to change your (OpenMP) code was exactly how Knight's corner was marketed, and it was also the reason why it tanked. IMO we're far off taking reasonable advantage of massive parallelism without changing code. The main reason is blocking - on multiple levels, grid, multicore, vector. Compilers just aren't smart enough to do that for you - and at least CUDA has merged two of those together so you only need to care about two levels of blocking.

Talking about Numpy, isn't that what Numba Pro is already doing with GPU?

vegabook · on July 22, 2016

yeah but where can you actually buy the Xeon Phi KL? I only see dev-programme versions. I would love to pick one up.

lorenzhs · on July 22, 2016

The webshops where I've seen it say it will be available from August 9th, so hopefully it won't be long.

g0xA52A2A · on July 22, 2016

Early deals have just started shipping. Expect wider market availability in a few months.

raverbashing · on July 22, 2016

Until this is out there, this can be safely ignored

Also, it needs a Xeon processor. You can plug the nVidia card wherever you want

g0xA52A2A · on July 22, 2016

> Also, it needs a Xeon processor. You can plug the nVidia card wherever you want

What? The first KNL parts will be host CPUs rather than PCI add-on cards.

raverbashing · on July 22, 2016

Ah it seems you're right http://www.extremetech.com/extreme/226604-intels-next-genera...

hkhall · on July 22, 2016

As this thread is filled with people that know way, way more about CUDA and OpenCL than myself I hope that you will indulge me a serious question: I get that graphics cards are great for floating point operations and that bitwise binary operations are supported by these libraries, but are they similarly efficient at it?

Some background: I occasionally find myself doing FPGA design for my doctoral work and am realizing that the job market for when I get done may be better for me if I was fluent in GPGPU programming as it is easier to build, manage, and deploy a cluster of such machines than the same for FPGAs.

My current problem has huge numbers of XOR operations on large vectors and if OpenCL or CUDA could be learned and spun up quickly (I have a CS background) I may be inclined to jump aboard this train vs buying another FPGA for my problem.

ak217 · on July 22, 2016

http://docs.nvidia.com/cuda/cuda-c-programming-guide/#arithm...

Throughput of integer operations ranges between 25% and 100% of floating point FMA performance. 32-bit bitwise AND, OR, XOR throughput is equal to 32-bit FMA throughput.

jhj · on July 22, 2016

It depends upon the op / byte loaded intensity. Nvidia packs their GPUs with a lot of float32 (or float64) units because some problems (e.g., convolution, or more typical HPC problems like PDEs, which will probably be done in float64) have a high flop / byte ratio.

A problem just calculating, say, hamming distance or 1-2 integer bit ops per integer word loaded will probably be memory bandwidth bound rather than integer op throughput limited. More complicated operations (e.g., cryptographic hashing) that have a higher iop / byte loaded will be limited by the reduced throughput of the integer op functional units rather than memory bandwidth.

For "deep learning", convolution is one of the few operations that tends to be compute rather than memory b/w bound. It's my understanding that Sgemm (float32 matrix multiplication) has been memory b/w limited for a while on Nvidia GPUs. Though, if you muck around with the architecture (as with Pascal), the ratio of compute to memory b/w to compute resources (smem, register file memory) may change the ratios up.

ldargin · on July 22, 2016

AMD GPUs have a reputation for speedy integer operations, which are essentially bit-wise operations, so they are often chosen for bitcoin mining. So you might want to consider learning OpenCL, since CUDA runs only on NVidia cards.

sounds · on July 22, 2016

I've spent a lot of time using both OpenCL and CUDA, and I would recommend CUDA not because I like NVidia as a company, but because your productivity will be so much higher.

NVidia has really invested into their developer resources. Of course, if your time to write code and debug driver issues isn't that important, then an AMD card using OpenCL might be the right choice.

(I'll try to be honest about my bias against NVidia, so you can more accurately interpret my suggestions. I think along the lines of Linus Torvalds with regard to NVidia... http://www.wired.com/2012/06/nvidia-linus-torvald/ )

Svenstaro · on July 22, 2016

I think both of these can be learned reasonably quickly if you know a bit about C programming. I'd also tend to agree that GPGPU is probably a better bet than FPGAs these days.

epaulson · on July 22, 2016

For comparison, the first supercomputer that was in the Teraflops range and that was available outside of the nuclear labs was the Pittsburgh Terascale machine.

http://www.psc.edu/publicinfo/news/2001/terascale-10-01-01.h...

It cost $45M, and peaked at 6 Teraflops. (I think that was on 32 bit floats, but I can't find the specs. It might have been 64 bit floats)

"Total TCS floor space is roughly that of a basketball court. It uses 14 miles of high-bandwidth interconnect cable to maintain communication among its 3,000 processors. Another seven miles of serial, copper cable and a mile of fiber-optic cable provide for data handling.

The TCS requires 664 kilowatts of power, enough to power 500 homes. It produces heat equivalent to burning 169 pounds of coal an hour, much of which is used in heating the Westinghouse Energy Center. To cool the computer room, more than 600 feet of eight-inch cooling pipe, weighing 12 tons, circulate up to 900 gallons of water per minute, and twelve 30-ton air-handling units provide cooling capacity equivalent to 375 room air conditioners."

_jgvg · on July 22, 2016

It will be interesting to see how this compares to Google's custom TPU: https://cloudplatform.googleblog.com/2016/05/Google-supercha...

NVidia is still taking the one-size-fits-all approach to AI and graphics, maybe it is time to develop AI-specific hardware.

melling · on July 22, 2016

Isn't this simple economics? The R&D cost is amortized over a larger market. You can always try to build specialized chips but the market might not be nearly as big so there's a lot less money. The general purpose market still moves faster.

_jgvg · on July 22, 2016

Given that the market is big enough for a single player to develop a chip for internal usage, minimum efficient scale is probably not an issue.

I'd guess NVidia sells far more cards than Google will produce TPUs.

dbalan · on July 22, 2016

HN hug of death. Google cache here: https://webcache.googleusercontent.com/search?q=cache:https%...

clw8 · on July 22, 2016

It baffles me how a company as large as Nvidia could get death hugged by HN. I guess people are just that excited for Titan X.

anc84 · on July 22, 2016

The company is fine, just one of their blogs is down. It's Wordpress, maybe they did bad or no caching.

p1esk · on July 22, 2016

Anyone knows what is this new INT8 instruction does?

nabla9 · on July 22, 2016

NVIDIA says: 44 TOPS INT8 (new deep learning inferencing instruction)

I think it's a related to storing float arrays as arrays of 8-bit integers in memory and converting them into floats just before using. It's 2x more space efficient than fp16

http://www.kanadas.com/program-e/2015/11/method_for_packing_...

astrodust · on July 22, 2016

You can get away with 8-bit floats on many neural network type applications so that might be the idea here.

bitfiddler0 · on July 22, 2016

It's an 8-bit int type though. Perhaps even that's sufficient for inference?

astrodust · on July 23, 2016

For a lot of network types that will do the job. 2x the performance is often better than 2x the fidelity. You can get higher accuracy with more nodes.

svennek · on July 22, 2016

It is most likely just to show, that it is even faster working with 64-bit ints than floats... So my guess, all integer ops...

rys · on July 22, 2016

It's not all integer ops. The peak throughput comes from a 4-wide accumulated dot product. dp4a if you want to look it up.

mrb · on July 22, 2016

Slightly more tech details at http://wccftech.com/nvidia-geforce-gtx-titan-x-pascal-unleas... Notably: 1.53GHz is the boost clock, while the base clock is 1.41GHz.

Does anyone knows the performance of the half-precision units (16-bit floating point)? It's probably 1/64 the FP32 rate, but Nvidia may have been generous and uncapped it at 2× FP32 like GP100, which would be a big difference (128× factor!)

zk00006 · on July 22, 2016

Can we expect the price of "old" Titan X to drop after this announcement? I would like to upgrade 4x SLI.

dogma1138 · on July 22, 2016

If you have a 40 PCIE lane CPU (and not NVME drives) was the price the thing that actually held you back? The scaling past 2 cards is also pretty horrible, depending on how much of a performance increase it would most likely still be cheaper to get 2 new Titan X's instead of upgrading your maxwell cards if you can offload them at 400-500$.

akiselev · on July 22, 2016

It would probably be pretty useless for gaming over 2x or 3x SLI but depending on whether any CUDA work you need to do is compute or memory (not bandwidth) bound, it can still make a difference worth the cost.

dogma1138 · on July 22, 2016

Well SLI has nothing to do with CUDA, you can have as many CUDA supported cards as you want and use all of them they don't even have to be from the same model/generation and you can use them all. Since he mentioned SLI i assumed it was for gaming since it's the only thing that actually limits you. But that said the number of PCIE lanes is still a problem this is why people that do work on compute opt out (well are forced to) use Xeon parts with QPI to get enough PCIE lanes because the amount of lanes available for standard desktop parts (including the PCH) is pretty pathetic even if you are going with the full 40 PCIE lanes E series of CPU's. This can be even worse if you want to use SATA Express or NVME drives since they also use your PCIe lanes, as well as a few other things like M.2 wireless network cards (pretty common these days on the upper midrange and high end motherboards), Thunderbolt and a few other things.

akiselev · on July 22, 2016

To repeat what I just said woth slightly different wording: Yes, my point was specifically referring to the difference between users who want the Titan X for supercomputing versus those who want it for gaming. You can have as many CUDA cards as you want but you face the same problem of limited PCIe bandwidth just like you do with gamers who want SLI. If your use case is CUDA and not gaming, then the usefulness of four Titans depends entirely on whether your algorithms are limited by memory, memory bandwidth, or computing power.

steve19 · on July 22, 2016

Annoying Apple style branding "new Titan X" and "previous Titan X".

X2 would have been easier all round.

anoother · on July 22, 2016

I agree, but X2 already holds the connotation of a dual-GPU card.

setrofim_ · on July 22, 2016

Could have gone with "TITAN XX" then. The added advantage is that the next one would be "TITAN XXX", and think of the marketing opportunity there!

earthnail · on July 22, 2016

The Titan XXX will be a breakthrough in VR.

gambiting · on July 22, 2016

What was the last dual-gpu card which had the name ending with X2? GeForce 9800 GX2? That was released in....2008. 8 years ago. I think by now it's safe to use X2 for other things, no?

freeone3000 · on July 22, 2016

Titan P, for Pascal.

kyriakos · on July 22, 2016

how about Tita X V2 or Titan X 2017 etc

justinclift · on July 22, 2016

TITAN Z :)

neurostimulant · on July 22, 2016

Titan XL

LeoPanthera · on July 22, 2016

Titan 2 X?

dharma1 · on July 22, 2016

Fp16?

g0xA52A2A · on July 22, 2016

I'm guessing FP32 as other announcements for the FLOP counts of GTX1080 and GTX1070 were single precision.

quantumhobbit · on July 22, 2016

While we're at it, Fp64?

ipunchghosts · on July 22, 2016

What is the FP64 performance???

ipunchghosts · on July 22, 2016

https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_proces...

Looks like its still crap :(

Long live the titan black!

akhilcacharya · on July 22, 2016

>$1200

Well I'm glad I got my 1080 now.

gabiruh · on July 22, 2016

[flagged]

balls187 · on July 22, 2016

That was the first thought as well. But HN shuns memes. Looking forward to seeing VR Performance with this card.

ihuman · on July 22, 2016

I would expect that the card can run it. Crysis came out about 9 years ago.

xchaotic · on July 22, 2016

But can it run at 4k 60fps, VSync on, which seems to be the sweet spot for PC gamers with big budget (and big 4 screens/projectors)?

binarycrusader · on July 22, 2016

Well, since the 1080 Founder's Edition can only run Crysis 3 at ~35fps, I doubt it, considering the Titan X is supposed to be roughly ~25% faster for gaming applications:

http://www.anandtech.com/show/10325/the-nvidia-geforce-gtx-1...

But it will certainly be closer; my guess is that we're still another iteration out from running Crysis 3 at 60fps at 4k.

If you're talking about the original Crysis, I suppose that's possible -- I don't have any benchmarks handy though.

bcaulfield · on July 22, 2016

[flagged]

shpx · on July 22, 2016

Old account that only posts NVIDIA links, who's only comment is this one. No one was "actually speechless" about your sales pitch for a GPU slightly better than the last one.

feelix · on July 22, 2016

What is so crazy about it? (this is a genuine question)

mtgx · on July 22, 2016

> 3,584 CUDA cores at 1.53GHz (versus 3,072 cores at 1.08GHz in previous TITAN X)

Isn't this chip made on 14nm FinFET compared to 28nm for the previous one? Why only 20% more cores? Is the chip smaller this time?

jsheard · on July 22, 2016

I'm guessing that GP102 is actually a 3840 core part (50% bigger than GP104), but this initial release is being cut down to improve yields on the immature 16nm process.

They did something similar with the original GK110 Titans - the first Titan had 2688 cores enabled, and a fully unlocked 2880 core Titan Black came later.

dogma1138 · on July 22, 2016

Both my (Maxwell) Titan X's boost to 1480mhz in SLI (the one with the slightly better ASIC 94% "asic quality" boosts to 1515 if i manually overclock it in unlinked mode). If the new Titan X is anything like the 1080 and can boost to around 2ghz it would be a monster of a card, if it's really held back to only ~1500mhz because of either power limit or that the GP102 ASIC simply cannot be clocked as high as the GP104 and 106 parts it would be a pretty small improvement over the existing Titan X (probably around 15%).

paulmd · on July 22, 2016

No confirmed technical details yet AFAIK, but it's rumored to be 450-475 mm^2 versus the old 600mm^2. It's also very probably a die-harvest (cores disabled).

mpg33 · on July 22, 2016

> Isn't this chip made on 14nm FinFET

16nm FinFET. AMD's current gen of cards are on 14nm FinFET.

creshal · on July 22, 2016

The chip is 50% bigger (12bn transistors vs. 8bn), I assume (there's no details on the card yet) the additional budget went into things like caches to give the cores an additional performance boost.

ethbro · on July 22, 2016

Bigger in transistor count != bigger in physical size given a process shrink at the same time, which might be what the parent comment was asking.

314 · on July 22, 2016

About 50-60% faster than a 980-ti / old titan. Quite pedestrian for a generation of cards.

zeroer · on July 22, 2016

Do you think we're in the mid-2000s?

314 · on July 22, 2016

Do you think this is any different from the trend in the past five generations from Nvidia?

dogma1138 · on July 22, 2016

Yes, gen to gen was 15-20% overall improvement (which is also questionable because when a new gen arrives all of a sudden the previous gen starts losing performance in games, it was really bad between Kepler and Maxwell where the 780/780ti all of a sudden lost 5-15% performance in certain games with initial "Maxwell Drivers").

314 · on July 22, 2016

Depends on which figures you look at. http://www.techspot.com/article/928-five-generations-nvidia-... has a reasonable comparison. Varies between 20-50% over the range of that sample. The increase in performance between the 980/1080 or the titans does not seem out of the "ordinary".

dogma1138 · on July 22, 2016

Look at the final trend chart

780ti > 980 = ~15% increase 780 > 780ti = ~12% increase 680 (potentially should've used the 690 as a reference even tho it was a dGPU card but it was the) > 780 = ~27 increase.

For the most part in generation that there wasn't a near ~2 year gap, and in generations where there wasn't a huge change in GPU memory type or a major architecture change (like dropping the hardware scheduler in Fermi for Kepler in order to save silicon space for things that actually matter(ed) for game at the time) there isn't a 50% increase gen to gen. 15-20% increase gen to gen with comparable price point cards (nvidia has been making it harder now by charging 150-200$ more per card and effectively bumping a price point lately) is what you should be expecting.

oldmanjay · on July 22, 2016

Does that make the engineering work automatic? If you believe so, how much of that belief stems from the fact that you don't do it?

314 · on July 22, 2016

I have not said the engineering work is automatic. It is not too far from the area that I work in, and I am up to date in the subject. The performance gains for this generation of cards are in line with the existing trend, they are not particularly exciting in that they rise above that trend. This is unrelated to the issue of whether or not the architectural work is hard, and it is somewhat sad that the difference between these two points has to be explicitly stated.

The inline results of this generation do diverge from the over the top marketing from Nvidia in the past year about the 10x increase in performance that would be delivered, largely due to the upgrade to HBM. It is certainly probably that had they delivered on HBM we would be seeing a rare jump in performance above the trend, but clearly this has not happened. It seems unlike to arrive on the 1080-ti after not making an appearance on the Titan and so we will need to wait another generation to see what difference that actually makes.

astrodust · on July 22, 2016

If it's not too far from the area you work in you should be more understanding of the engineering challenges here.

We can't clock these things faster like we used to in the days of 6Mhz CPUs. We can't shrink feature size since 10nm is proving to be a difficult node to crack. We can't jam more features onto the die since we're already producing some of the largest possible dies.

The easy gains are gone. Now we're stuck dealing with the hard stuff. Gains will be slower.

314 · on July 23, 2016

As I said above: there are two separate issues. Whether or not it is hard to deliver the same (relative) improvement in each generation. Whether or not Nvidia have delivered an improvement in line with their normal trend, or whether they have exceeded it.

These are not the same thing. As I said: I am well aware of the engineering challenges at this scale. You are trying to argue that it is hard: so what? Should Nvidia get a badge for "trying hard". It still doesn't change the fact that what they have achieved in this generation is the same (relative) improvement as previous generations. This is not what they were selling in the run-up to the 1080 launch when they were still claiming 10x improvements due to the new memory subsystem.

They did not do what they said that they would. This is quite a simple fact - why do you try to argue that what they have done is hard. The difficulty is irrelevant. Do you think that an evaluation of the performance that they achieved should include how hard they worked on it?

astrodust · on July 23, 2016

They fit 12 billion transistors on their latest die, that's even more than a top-end Xeon. You can bitch all day from your comfortable arm chair, but they're leading the industry in counts and that alone can't be easy. They're not silk screening t-shirts here, they're pushing their process to the limits.

If they were falling behind in transistor counts you might have a case, but they're not.

314 · on July 23, 2016

So again you are arguing that they are trying really hard. What does that have to do with whether or not this generation is above the trend-line or not?

astrodust · on July 25, 2016

I'm arguing they're at the top of their game and they're up against a wall that is not easy to move.

It doesn't really matter if it's above or below the trend line. Maybe the trend line is bullshit now because all the things driving it were the easy gains we've exhausted.