Great news all-around for deep learning practitioners.
Nvidia says memory bandwidth is "460GB/s," which will probably have the most impact on deep learning applications (lots of giant matrices must be fetched from memory, repeatedly, for extended periods of time). For comparison, the GTX 1080's memory bandwidth is quoted as "320GB/s." The new Titan X also has 3,584 CUDA cores (1.3x more than GTX 1080) and 12GB of RAM (1.5x more than GTX 1080).
We'll have to wait for benchmarks, but based on specs, this new Titan X looks like the best single GPU card you can buy for deep learning today. For certain deep learning applications, if properly configured, two GTX 1080's might outperform the Titan X and cost about the same, but that's not an apples-to-apples comparison.
A beefy desktop computer with four of these Titan X's will have 44 Teraflops of raw computing power, about "one fifth" of the raw computing power of the world's current 500th most powerful supercomputer.[1] While those 44 Teraflops are usable only for certain kinds of applications (involving 32-bit floating point linear algebra operations), the figure is still kind of incredible.
A beefy desktop computer with four of these Titan X's will have 44 Teraflops of raw computing power, about "one fifth" of the raw computing power of the world's current 500th most powerful supercomputer.
There have been calls that Moore's Law is dead for at least 10 years, and yet desktops keep catching up to supercomputing.
I know this increase is not in single-threaded general-purpose computing power in the fashion of the old gigahertz race... But on the other hand, the scope of what's considered "general purpose" keeps expanding too. Machine learning may be part of mainstream consumer applications in 5 years.
Moore's law talks about scaling transistor density at a minimal cost. Its colloquial understanding is actually a particular case of the Experience Curve Effect with an added bonus. What made transistors special and unique for a while was the huge benefit from Dennard Scaling. However, the days where we could expect a next version to be of much smaller components while much faster at same power levels, for the same or lower price, are gone.
The experience and returns to scale are still plodding along but Dennard Scaling is gone and with it, its magic, vanished by physics.
Moore's law is a linear constant. Catching up to super computers could mean that progress on all fronts is slowing. Even if performance of super computers increases by 1.01x, and personal computers increase by 1.02x, that would still fit the bill of "catching up to supercomputers." Neither would be the required doubling required to fit the definition of Moore's law.
Moore's law is more about what happens when you get a lot of people working on a clearly defined problem + network effects where a new breakthrough helps everyone else get nearer to another breakthrough. This is why you find Moores law in so many places. It studies rate of change in technological progress.
In a particular area (eg, single core cpu performance) it eventually follows a sigmoid curve (where all hockey sticks must go) as lower hanging fruit exposes itself (eg, higher parrallelism and GPU computing).
Kurzweil (say what you will) has written a lot on the topic.
With four or six cards, as many systems built around this component are likely to be spec'd, you'd be able to build something that could fall in the top 5 in the last ten years.
>Machine learning may be part of mainstream consumer applications in 5 years.
It doesn't seem likely since the Titan X pulls 250 watts. Maybe if we get asics for deep learning which consume less power; and a lot of research is done on deforesting trained networks so users can compute classifications more cost effectively, then sure.
Moore's Law: "Moore's law (/mɔərz.ˈlɔː/) is the observation that the number of transistors in a dense integrated circuit doubles approximately every two years."
That's the definition I've always known. It has nothing to do with money or speed or performance, but usually these things are correlated.
Yes. My point was that the scope of applications seems to be expanding (even on the consumer level) such that single-threaded performance on an x86-style instruction set and memory model isn't the end-all definition of computing power anymore.
Well... this means you can train models that are very slightly larger in the same time. It wont scale linearly. And those slightly larger models will typically be a very, very, very tiny bit better - quality scales far below linearly.
Going by the specs, I would expect that training current models on the new Titan X will be be 1.3x to 2.0x faster than with the older Titan X. In practice, this means deep neural nets that now take four weeks to train will take two or three weeks instead. That's great news.
Bigger models don't (usually) mean dealing with higher quality images or something. They mean larger NN layers - and more of them - are able to fit in memory.
This is important for image stuff, true, but it becomes really important for things like Memory Networks for NLP tasks and Neural Turing Machine-like architectures. Bigger networks mean they can "remember" further "backwards" in their dependencies, and so do things they can't do before.
That's GREAT news.
Also, the speed up is pretty significant. People train for weeks at the moment - a 10% speed up means often means savings days. This might not seem much, but it lets you iterate much, much quicker.
From my own experience and from what I've seen others do, that's not what happens. People train for as long as they can afford to. If that is weeks, it's still going to be weeks. Instead, they're going to make the models slightly more sophisticated.
If people really cared about 20% speedups that much, they could train for 20% fewer iterations at a slightly higher learning rate. Or they could use very slightly less deep networks. Or slightly less wide netwoks. Or slightly less connected networks. Or. most realistically, some combination of many very small changes.
Of course it's nice to have faster machines, but this is hardly a big enough change to make a very noticeable difference, let alone a dramatic difference.
There's nothing wrong with iterative improvement, but let's not pretend it's a revolution.
With 4 of those in a desktop, you'd also have a serious cooling problem. Most likely, you'd burn them out. Note, I run many servers with > 4 of these types of GPUs in them for $day_job machine learning type of problems.
It's nothing to write home about when it comes to long-running applications. I have 2x 980tis with liquid cooling, and it runs pretty damn hot when going for a week at 90%+ capacity. I can't use it in the summer months, even leaving my AC on while I'm at work.
As great as this product may be, I went to the Stanford Deep Learning Meetup to learn more about how Baidu Research/Andrew Ng are solving large scale deep learning problems. I am disappointed by how much (unannounced) time was dedicated to the keynote/sales pitch.
If his job as a Chief Scientst at Baidu is similar to the Director of Research at Google (Peter Norvig) he is too busy to be part of individual research projects.
He is probably one of the guys who decides where the direction of company research is heading and the one who supervises those projects.
A moment to champion the importance of a great evangelist, research is important, though without propagating the results and helping with 'real world's deployment, it might be effort wasted :)
Since this question has come up so many times in the thread, my take is that FP64 and FP16 won't be as good at the GP100. If TITAN X is based on the consumer parts, it misses out on the GP100's FP improvements.
From Anandtech's GTX1080 review page2: As a result while GP100 has some notable feature/design elements for HPC – things such faster FP64 & FP16 performance, ECC, and significantly greater amounts of shared memory and register file capacity per CUDA core – these elements aren’t present in GP104 (and presumably, future Pascal consumer-focused GPUs).
This requires confirmation though, it depends on whether it uses the consumer chip or the HPC chip.
Well the Titan X is a new chip: GP102. So they could have picked and chosen features from either GP100 (professional Tesla) or GP104 (consumer GeForce).
We know almost certainly it has lower FP64 perf than GP100 because it has 22% fewer transistors and the FP64 units take a lot of transistors count. However it is less clear about what the FP16 performance is. Nvidia could have decided to match GP100 on that regard.
Doubt it, why have three versions of a chip (one with fp16 improv, one with fp16/fp64 improv and one without them)? Especially when the Titan X has the same number of CUDA cores as the GP100.
When they quote "CUDA cores", they've been counting float32 fma functional units; e.g., Tesla K40 has 192 float32 fma units per SM x 15 SMs => 2880 "CUDA cores".
fp16 and fp64 are likely different functional units with different issue rates, as is the case with old hardware; unless they've managed to share the same hardware (since for P100 the quoted fp64 rate is exactly half the fp32 rate, and the fp16 is exactly double the fp32 rate).
There's no real analogue to a CPU core (or thread) on a GPU, there are warp schedulers (Nvidia) and functional units with varying throughput rates. The closest analogue to a CPU thread is the GPU warp (Nvidia), which shares a program counter and executes as a 32-wide vector. AMD wavefronts (64-wide) are a little bit different, but not by much. The CUDA "thread" is really an illusion that makes a SIMD lane easier to program to (sometimes...)
I agree, I think an SMX is the closest to a CPU core - it contains dispatch, cache, schedulers etc. CUDA threads are of course have important differences since all threads in a warp move in lockstep. IMO, all of these contrived definitions are just to ease in programmers into the CUDA/OpenCL model or for bragging rights (like how a Jetson has 192 cores!).
The 22% transistor count could also be due to lack of HBM the memory bus and controllers for that take a big chunk of die space, but I'm also thinking that it would only do 1/2 of the fp64 that the tesla does.
I'm disappointed at this Titan tbh 1200$ and no hbm2 memory? If AMD doesn't not royally screwup Vega might be time for a shift unless you explicitly need CUDA.
> it is less clear about what the FP16 performance is
Very curious about this as well. I got two GeForce 1080s as soon as they came out, and was very sad to discover that the stated advantage of Pascal architecture (speedup on FP16) is completely lacking on those.
I'm looking forward to seeing this in stores, as I've wanted to build a new machine learning rig for some time. The GTX 1080 just didn't seem like it would do the trick, with ostensibly limited software support and all.
I'm specifically wondering about FP16 handling though. Single precision FLOPS are never mentioned in the blog, nor on the NVidia page. It would be a shame if the FP16 units on this card were gimped in the same way as the GTX 1080...
I don't think it'll be good. It's probably based on the same chips used in the GTX 1080 which doesn't have the FP16 improvements that GP100 has. I can't be a hundred percent on this since NVIDIA hasn't confirmed.
From Anandtech's 1080 review (page 2): "As a result while GP100 has some notable feature/design elements for HPC – things such faster FP64 & FP16 performance, ECC, and significantly greater amounts of shared memory and register file capacity per CUDA core – these elements aren’t present in GP104 (and presumably, future Pascal consumer-focused GPUs)."
NVIDIA is always publishing just the numbers that it wants to publish. There are still improvements, but they are slowing down.. I'm sure the engineers did their best.
I've had notifications set up on Newegg for every variant and I have been notified for only two of the lesser variants, and by the time I check the site it's already out of stock.
Interesting that this does not have HBM2 memory. Apparently this will only be for the Tesla GPUs on pascal GPUs unless they are going to put it on the 1080 ti, which does not seem likely when the Titan does not have it.
I don't see this showing up in consumer parts soon. It seems Nvidia can squeeze out GDDR5(X) outside of the HPC space for this generation. Which is good for cost but also reduces the risk in terms of reliability of throwing lots of new technology into a single product.
HPC obviously has different requirement but Nvidia can work with interrogators and customers with less of a backlash when fixing issues in this segment.
NVidia actually care about research, researchers and the scientific computing market.
Next time someone complains about the lack of OpenCL support, again, in another framework remember how much work NVidia puts into supporting people who use their cards for scientific computing, and how they listen to them.
Microsoft also carefully listened to developers while building DirectX and by versions 8 and especially 9, it really showed. But only Windows benefited from this. Having the control of important GPU tech so strongly centered about a single company is never a good idea, it sets up a conflict of interest.
Something like OpenCL does not face the same conflict nVidia would if porting core APIs across a wide set of competing technologies. With CUDA, nVidia prioritizes themselves above AMD, intel, FPGAs and whatever parallel compute technology the future holds.
But the truth is that without the hardware vendors putting significant resources into OpenCL it just isn't competitive and won't be until that happens.
The truth is that most of the work in Deep Learning is developing new NN architectures and other algorithmic optimisations. If you are working in the field there is no reason to put up with second class support from non NVidia vendors - just build in TensorFlow, Torch or a couple of other frameworks and wait for the day (one day, we are promised!) when OpenCL is competitive. Then the framework backends get ported, your code keeps running the same, and it can run on all those other architectures.
Everyone has been waiting for that day since One Weird Trick[1]. There isn't really anything to indicate it is getting closer, and AMD's dismissal of the NVidia "doing something in the car industry"[2] doesn't give me a lot of confidence.
Anyway, I hope I'm wrong. Maybe Intel will step-up.
While ATI was always about gaming, gaming, gaming, NVidia always worried about the Pro market (Quadro, Linux support, even with proprietary modules, etc) and now with Deep Learning
And now they can sell their deep learning processor for embedded applications (equals $$$)
There are some numbers for 2015 at [1]. In the Enterprise, HPC and Auto markets they had a bit more than $1B in revenue. I believe the Auto market includes some Tegra numbers, but that is "only" $180M.
Gaming is a little more than $2B, and "OEM & IP" is $1B.
Intel already surpassed the old Titan X with their Xeon Phi Knights Landing CPUs which have a peak performance of up to 7TFLOPS. It was about time that they release a new Titan X.
I'll believe it when we actually get some real world benchmarks with these. Knight's corner was also overhyped quite a bit by Intel's peak performance claims - mainly because the tools just weren't nearly on the same level as Nvidia's.
To really make use of this kind of hardware, the compiler is just the start. You need debuggers, profilers, MPI integration, and, last but not least, a sensible programming model in the first place. Nvidia has been pushing that stack for close to 10 years now.
I think the idea is that the programming model is not much different from what we've all been doing on multicore for a while now already. Sure there'll be evolution, but it's not nearly as different as GPU compute. Also, I personally am hoping that concurrent algos will be much better suited to this than matrix-based GPUs which are basically good for massively parallel floating point only. For example, I'd love to see how a big fat BEAM VM will work on this, not to mention golang. IE I'm actually thinking that Knight's Landing will open up an oblique space where new things are possible, while not losing too much ground on the basic idea of the GPU which is masses of dumb floating point performance.
Even in Python. Imagine firing up a 288-hardware-core Multiprocessing pool using a AVX-512 enabled Numpy...without changing your code. Personally that's numerical porn for me and I want Intel to shup up and take my money, now.
not having to change your (OpenMP) code was exactly how Knight's corner was marketed, and it was also the reason why it tanked. IMO we're far off taking reasonable advantage of massive parallelism without changing code. The main reason is blocking - on multiple levels, grid, multicore, vector. Compilers just aren't smart enough to do that for you - and at least CUDA has merged two of those together so you only need to care about two levels of blocking.
Talking about Numpy, isn't that what Numba Pro is already doing with GPU?
As this thread is filled with people that know way, way more about CUDA and OpenCL than myself I hope that you will indulge me a serious question: I get that graphics cards are great for floating point operations and that bitwise binary operations are supported by these libraries, but are they similarly efficient at it?
Some background: I occasionally find myself doing FPGA design for my doctoral work and am realizing that the job market for when I get done may be better for me if I was fluent in GPGPU programming as it is easier to build, manage, and deploy a cluster of such machines than the same for FPGAs.
My current problem has huge numbers of XOR operations on large vectors and if OpenCL or CUDA could be learned and spun up quickly (I have a CS background) I may be inclined to jump aboard this train vs buying another FPGA for my problem.
Throughput of integer operations ranges between 25% and 100% of floating point FMA performance. 32-bit bitwise AND, OR, XOR throughput is equal to 32-bit FMA throughput.
It depends upon the op / byte loaded intensity. Nvidia packs their GPUs with a lot of float32 (or float64) units because some problems (e.g., convolution, or more typical HPC problems like PDEs, which will probably be done in float64) have a high flop / byte ratio.
A problem just calculating, say, hamming distance or 1-2 integer bit ops per integer word loaded will probably be memory bandwidth bound rather than integer op throughput limited. More complicated operations (e.g., cryptographic hashing) that have a higher iop / byte loaded will be limited by the reduced throughput of the integer op functional units rather than memory bandwidth.
For "deep learning", convolution is one of the few operations that tends to be compute rather than memory b/w bound. It's my understanding that Sgemm (float32 matrix multiplication) has been memory b/w limited for a while on Nvidia GPUs. Though, if you muck around with the architecture (as with Pascal), the ratio of compute to memory b/w to compute resources (smem, register file memory) may change the ratios up.
AMD GPUs have a reputation for speedy integer operations, which are essentially bit-wise operations, so they are often chosen for bitcoin mining. So you might want to consider learning OpenCL, since CUDA runs only on NVidia cards.
I've spent a lot of time using both OpenCL and CUDA, and I would recommend CUDA not because I like NVidia as a company, but because your productivity will be so much higher.
NVidia has really invested into their developer resources. Of course, if your time to write code and debug driver issues isn't that important, then an AMD card using OpenCL might be the right choice.
(I'll try to be honest about my bias against NVidia, so you can more accurately interpret my suggestions. I think along the lines of Linus Torvalds with regard to NVidia... http://www.wired.com/2012/06/nvidia-linus-torvald/ )
I think both of these can be learned reasonably quickly if you know a bit about C programming. I'd also tend to agree that GPGPU is probably a better bet than FPGAs these days.
For comparison, the first supercomputer that was in the Teraflops range and that was available outside of the nuclear labs was the Pittsburgh Terascale machine.
It cost $45M, and peaked at 6 Teraflops. (I think that was on 32 bit floats, but I can't find the specs. It might have been 64 bit floats)
"Total TCS floor space is roughly that of a basketball court. It uses 14 miles of high-bandwidth interconnect cable to maintain communication among its 3,000 processors. Another seven miles of serial, copper cable and a mile of fiber-optic cable provide for data handling.
The TCS requires 664 kilowatts of power, enough to power 500 homes. It produces heat equivalent to burning 169 pounds of coal an hour, much of which is used in heating the Westinghouse Energy Center. To cool the computer room, more than 600 feet of eight-inch cooling pipe, weighing 12 tons, circulate up to 900 gallons of water per minute, and twelve 30-ton air-handling units provide cooling capacity equivalent to 375 room air conditioners."
Isn't this simple economics? The R&D cost is amortized over a larger market. You can always try to build specialized chips but the market might not be nearly as big so there's a lot less money. The general purpose market still moves faster.
NVIDIA says: 44 TOPS INT8 (new deep learning inferencing instruction)
I think it's a related to storing float arrays as arrays of 8-bit integers in memory and converting them into floats just before using. It's 2x more space efficient than fp16
Does anyone knows the performance of the half-precision units (16-bit floating point)? It's probably 1/64 the FP32 rate, but Nvidia may have been generous and uncapped it at 2× FP32 like GP100, which would be a big difference (128× factor!)
If you have a 40 PCIE lane CPU (and not NVME drives) was the price the thing that actually held you back?
The scaling past 2 cards is also pretty horrible, depending on how much of a performance increase it would most likely still be cheaper to get 2 new Titan X's instead of upgrading your maxwell cards if you can offload them at 400-500$.
It would probably be pretty useless for gaming over 2x or 3x SLI but depending on whether any CUDA work you need to do is compute or memory (not bandwidth) bound, it can still make a difference worth the cost.
Well SLI has nothing to do with CUDA, you can have as many CUDA supported cards as you want and use all of them they don't even have to be from the same model/generation and you can use them all.
Since he mentioned SLI i assumed it was for gaming since it's the only thing that actually limits you.
But that said the number of PCIE lanes is still a problem this is why people that do work on compute opt out (well are forced to) use Xeon parts with QPI to get enough PCIE lanes because the amount of lanes available for standard desktop parts (including the PCH) is pretty pathetic even if you are going with the full 40 PCIE lanes E series of CPU's.
This can be even worse if you want to use SATA Express or NVME drives since they also use your PCIe lanes, as well as a few other things like M.2 wireless network cards (pretty common these days on the upper midrange and high end motherboards), Thunderbolt and a few other things.
To repeat what I just said woth slightly different wording: Yes, my point was specifically referring to the difference between users who want the Titan X for supercomputing versus those who want it for gaming. You can have as many CUDA cards as you want but you face the same problem of limited PCIe bandwidth just like you do with gamers who want SLI. If your use case is CUDA and not gaming, then the usefulness of four Titans depends entirely on whether your algorithms are limited by memory, memory bandwidth, or computing power.
What was the last dual-gpu card which had the name ending with X2? GeForce 9800 GX2? That was released in....2008. 8 years ago. I think by now it's safe to use X2 for other things, no?
Well, since the 1080 Founder's Edition can only run Crysis 3 at ~35fps, I doubt it, considering the Titan X is supposed to be roughly ~25% faster for gaming applications:
Old account that only posts NVIDIA links, who's only comment is this one. No one was "actually speechless" about your sales pitch for a GPU slightly better than the last one.
I'm guessing that GP102 is actually a 3840 core part (50% bigger than GP104), but this initial release is being cut down to improve yields on the immature 16nm process.
They did something similar with the original GK110 Titans - the first Titan had 2688 cores enabled, and a fully unlocked 2880 core Titan Black came later.
Both my (Maxwell) Titan X's boost to 1480mhz in SLI (the one with the slightly better ASIC 94% "asic quality" boosts to 1515 if i manually overclock it in unlinked mode).
If the new Titan X is anything like the 1080 and can boost to around 2ghz it would be a monster of a card, if it's really held back to only ~1500mhz because of either power limit or that the GP102 ASIC simply cannot be clocked as high as the GP104 and 106 parts it would be a pretty small improvement over the existing Titan X (probably around 15%).
No confirmed technical details yet AFAIK, but it's rumored to be 450-475 mm^2 versus the old 600mm^2. It's also very probably a die-harvest (cores disabled).
The chip is 50% bigger (12bn transistors vs. 8bn), I assume (there's no details on the card yet) the additional budget went into things like caches to give the cores an additional performance boost.
Yes, gen to gen was 15-20% overall improvement (which is also questionable because when a new gen arrives all of a sudden the previous gen starts losing performance in games, it was really bad between Kepler and Maxwell where the 780/780ti all of a sudden lost 5-15% performance in certain games with initial "Maxwell Drivers").
Depends on which figures you look at. http://www.techspot.com/article/928-five-generations-nvidia-... has a reasonable comparison. Varies between 20-50% over the range of that sample. The increase in performance between the 980/1080 or the titans does not seem out of the "ordinary".
780ti > 980 = ~15% increase
780 > 780ti = ~12% increase
680 (potentially should've used the 690 as a reference even tho it was a dGPU card but it was the) > 780 = ~27 increase.
For the most part in generation that there wasn't a near ~2 year gap, and in generations where there wasn't a huge change in GPU memory type or a major architecture change (like dropping the hardware scheduler in Fermi for Kepler in order to save silicon space for things that actually matter(ed) for game at the time) there isn't a 50% increase gen to gen.
15-20% increase gen to gen with comparable price point cards (nvidia has been making it harder now by charging 150-200$ more per card and effectively bumping a price point lately) is what you should be expecting.
I have not said the engineering work is automatic. It is not too far from the area that I work in, and I am up to date in the subject. The performance gains for this generation of cards are in line with the existing trend, they are not particularly exciting in that they rise above that trend. This is unrelated to the issue of whether or not the architectural work is hard, and it is somewhat sad that the difference between these two points has to be explicitly stated.
The inline results of this generation do diverge from the over the top marketing from Nvidia in the past year about the 10x increase in performance that would be delivered, largely due to the upgrade to HBM. It is certainly probably that had they delivered on HBM we would be seeing a rare jump in performance above the trend, but clearly this has not happened. It seems unlike to arrive on the 1080-ti after not making an appearance on the Titan and so we will need to wait another generation to see what difference that actually makes.
If it's not too far from the area you work in you should be more understanding of the engineering challenges here.
We can't clock these things faster like we used to in the days of 6Mhz CPUs. We can't shrink feature size since 10nm is proving to be a difficult node to crack. We can't jam more features onto the die since we're already producing some of the largest possible dies.
The easy gains are gone. Now we're stuck dealing with the hard stuff. Gains will be slower.
As I said above: there are two separate issues. Whether or not it is hard to deliver the same (relative) improvement in each generation. Whether or not Nvidia have delivered an improvement in line with their normal trend, or whether they have exceeded it.
These are not the same thing. As I said: I am well aware of the engineering challenges at this scale. You are trying to argue that it is hard: so what? Should Nvidia get a badge for "trying hard". It still doesn't change the fact that what they have achieved in this generation is the same (relative) improvement as previous generations. This is not what they were selling in the run-up to the 1080 launch when they were still claiming 10x improvements due to the new memory subsystem.
They did not do what they said that they would. This is quite a simple fact - why do you try to argue that what they have done is hard. The difficulty is irrelevant. Do you think that an evaluation of the performance that they achieved should include how hard they worked on it?
They fit 12 billion transistors on their latest die, that's even more than a top-end Xeon. You can bitch all day from your comfortable arm chair, but they're leading the industry in counts and that alone can't be easy. They're not silk screening t-shirts here, they're pushing their process to the limits.
If they were falling behind in transistor counts you might have a case, but they're not.
So again you are arguing that they are trying really hard. What does that have to do with whether or not this generation is above the trend-line or not?
I'm arguing they're at the top of their game and they're up against a wall that is not easy to move.
It doesn't really matter if it's above or below the trend line. Maybe the trend line is bullshit now because all the things driving it were the easy gains we've exhausted.
Nvidia says memory bandwidth is "460GB/s," which will probably have the most impact on deep learning applications (lots of giant matrices must be fetched from memory, repeatedly, for extended periods of time). For comparison, the GTX 1080's memory bandwidth is quoted as "320GB/s." The new Titan X also has 3,584 CUDA cores (1.3x more than GTX 1080) and 12GB of RAM (1.5x more than GTX 1080).
We'll have to wait for benchmarks, but based on specs, this new Titan X looks like the best single GPU card you can buy for deep learning today. For certain deep learning applications, if properly configured, two GTX 1080's might outperform the Titan X and cost about the same, but that's not an apples-to-apples comparison.
A beefy desktop computer with four of these Titan X's will have 44 Teraflops of raw computing power, about "one fifth" of the raw computing power of the world's current 500th most powerful supercomputer.[1] While those 44 Teraflops are usable only for certain kinds of applications (involving 32-bit floating point linear algebra operations), the figure is still kind of incredible.
[1] https://www.top500.org/list/2015/06/?page=5