This is probably going to be a hyper parallel fixed point / integer engine like TPU gen1. Doing fast matrix multiply over really small fields is very subpar on CPUs and GPUs. That was the initial reasoning behind TPU gen1 - improving runtime performance.
One question is if it will architecturally be closer to a GPU or an FPGA. The field moves so fast that it might make sense to "future-proof" a bit with a live-reconfigurable FPGA.
I'd bet on this being an ASIC, doing this on an FPGA with any serious size matrix would require a very expensive FPGA, whereas an ASIC would allow more gates in a smaller volume and would consume less power to boot.
From the manufacturers point of view phones not being future proof is a feature, not a bug, that way you'll upgrade to that new shiny item which will keep the profits rolling in.
> From the manufacturers point of view phones not being future proof is a feature, not a bug, that way you'll upgrade to that new shiny item which will keep the profits rolling in.
I don't think a better AI chip will be a convincing argument to change a phone one year later.
Not if you put it that way. Apple can simply make new AI features exclusive to newer phones with updated versions of the chips. If they open the chip up to developers, this effect can spill out to the app store. That would then provide the impetus to push consumers to upgrade.
Aren't all the iPhone chips ASIC's with the main one custom by those hardware people they acquired? Seems to be default expectation for whatever they add next. They sure as hell have the money, too. :)
DSPs are not nearly as good at matrix math as a GPU and the phone already has one of those. DSPs are typically good at signal processing in a fairly limited domain, they would not be calling it an AI chip if it was just a DSP.
You don't put an FPGA in a device you're going to sell 200M+ of. The cost per unit would be way higher than an asic, and your just going to come out with a better version next year anyway so why bother?
I foresee it as similar to their M series co-processors - the first one was pretty basic, and more sensors and jobs have been given to the newer ones each year.
I think a lot of people in this thread are making incorrect assumptions about FPGAs implementations of neural network applications.
(1) forward networks are constant multiplications, i.e. Fixed shift and add. FPGAs are very nearly optimal architecture for programmable constant-shift-and-add
(2) individual neurons in a network can be bitwidth optimized and huffman encoded for bit-serial execution, FPGAs are a very nearly optimal architure for variable bit-width operations in a bit serial architecture with programmable huffman decoders [edited: huffman encoding, not hamming]
(3) running a forward network requires multiple channels of parallel memory with separate but deterministic access patterns. Most fpga architurs are designed with onboard ram specifically to be used this way.
(4) fpga architectures can be designed inherently fault and defect tolerant, like gpus disabling cores, but with finer granularity. Especially if the compilation is done in the cloud, the particular defect / yield profile can be stored for placement optimization.
(5) anything optimized for ASIC design will be necessarily so close to an FPGA that it may as well benefit from the existing programmable logic ecosystem to be flexibly optimized for a particular trained network. You can't just tape out an asic for every trained network, but based my previous points, you most likely can optimize the logic for a specific forward network to run on an FPGA better than any asic designed to run arbitrary networks
There is a tiny one in the iPhone 7. But, that's for flexibility on current tasks not future proofing.
In terms of AI there is little reason to run it on the phone unless it's heavily used or needs or be low latency. Consider if they add a 100$ of computing power to a phone that sits unused 99% of the time they can just build a server using those same 100$ worth of parts that can then serve 100 phones saving 90+$ per phone including upkeep etc.
PS: This is the same logic why Amazon Echo is so cheap, it simply does not need much hardware.
It could be both. Perhaps Apple concluded that 1) they're subpar with cloud services and will have difficulty competing, 2) there's a growing need/demand for more privacy and less 'cloud', and 3) Apple's products are already, on the whole, recommended when it comes to privacy.
And based on that they figured privacy was a good thing to aim for. Play to their strengths and differentiate based on that.
Facebook has Cafe2Go. Apple is working on this (and already has bindings optimized to use the ARM vector unit for DNN evaluation).
Running on device, if it can be done with reasonable power, is a win for everyone. Better privacy, better latency, and more robust operation in the face of intermittent connectivity.
It'll be an ASIC so more GPU than FPGA. The real reason to upgrade the chip would be to add more transistors rather than any real instruction set upgrade so the FPGA doesn't really get Apple anything other than cost and wasted space.
FPGAs are pretty bad space-wise and power-wise compared to straight up ASICs. Apple could make some blocks highly configurable, but even an FPGA designer wouldn't use FPGA fabric to do multiplication if they cared about performance. FPGAs are a mix of general purpose logic blocks (the fabric) and dedicated blocks like multipliers, dividers, PLLs, memory, serializers and deserializers, etc.
That's what I'm thinking - some sort of configurable FPGA like fabric around a bulk of TPUv1 style cores, maybe for routing outputs around so you can do some nice pipe-lining like you might want with CV on video.
I don't think space is an issue, but an ASIC designed exactly for a workload will always beat an FPGA on power. But if you don't know the workload exactly or don't have the money to fab an ASIC then an FPGA will be superior if the workload is a bad fit for CPUs or GPUs. So if you can save (2-10)x power on some unknown ML workload in the future that might be preferable to (10-20)x on some fixed workload with a fixed-point ASIC.
I.e. Bitcoin mining went GPU->FPGA->ASIC, each with more investment required to design but higher overall performance in Hash/W. But that workload is known exactly.
I doubt they'll do an FPGA. Devices are too concerned with battery life to be running that, plus their margins would suffer or it'd be even more expensive.
One question is if it will architecturally be closer to a GPU or an FPGA. The field moves so fast that it might make sense to "future-proof" a bit with a live-reconfigurable FPGA.