Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Memory != Speed

You could throw a TB of memory in something and it won't get any faster or be of any use for 99.99% of use cases.

Large ML architectures don't need more memory, they need distributed processing. Ignoring memory requirements, GPT-3 would take hundreds of years to train on a single high end GPU (on say a desktop 3090 which is >10x faster than m1) which is why they aren't trained that way (and why NVidia has the offerings set up the way they do).

>That's getting close to "doing the training for GPT-3 — a 350GB model before optimization — on a single computer" territory.

Not even close... not by a mile. That isn't how it works. The unified memory is cool but its utility is massively bottlenecked by the single cpu/gpu it is attached to.



I don't disagree that there are many use cases for which more memory has diminishing returns. But I would disagree that those encompass 99.99% of use cases. Not all problems are embarrassingly-parallel. In fact, most problems aren't embarrassingly parallel.

It's just that we mostly use GPUs for embarrassingly-parallel problems, because that's mostly what they're good at, and humans aren't clever enough by half to come up with every possible way to map MIMD problems (e.g. graph search) into their SIMD equivalents (e.g. matrix multiplication, ala PageRank's eigenvector calculation.)

The M1 Max isn't the absolute best GPU for doing the things GPUs already do well. But its GPU is a much better "connection machine" than e.g. the Xeon Phi ever was. It's a (weak) TPU in a laptop. (And likely the Mac Pro variant will be a true TPU.)

Having a cheap, fast-ish GPU with that much memory, opens up use-cases for which current GPUs aren't suited. In those use-cases, this chip will "run circles around" current GPUs. (Mostly because current GPUs wouldn't be able to run those workloads at any speed.)

Just one fun example of a use-case that has been obvious for years, yet has been mostly moot until now: there are database engines that run on GPUs. For parallelizable table-scan queries, they're ~100x faster still than even memory databases like memSQL. But guess where all the data needs to be loaded into, for those GPU DB engines to do their work?

You'd never waste $150k on an A100 just to host an 80GB database. For that price, you could rent 100 regular servers and set them up as memSQL shards. But if you could get a GPU-parallel-scannable 64GB DB [without a memory-bandwidth bottleneck] for $4000? Now we're talking. For the cost of one A100, you get a cluster of ~37 64GB M1 Max MBPs — that's 2.3TB of addressable VRAM. That's enough to start doing real-time OLAP aggregations on some Big-Ish Data. (And that's with the ridiculous price overhead of paying for a whole laptop just to use its SoC. If integrators could buy these chips standalone, that'd probably knock the pricing down by another order of magnitude.)


Again, there is a huge memory bandwidth bottleneck. It's ddr versus gddr and hbm. It's not even close. The m1 will be slower.


Cerebras is a thing.


And at least an order of magnitude more expensive than an A100, if not two orders


I think you put extra zero in A100 price


> I don't disagree that there are many use cases for which more memory has diminishing returns. But I would disagree that those encompass 99.99% of use cases. Not all problems are embarrassingly-parallel. In fact, most problems aren't embarrassingly parallel.

Mindlessly throwing more memory does encompass diminishing returns in 99.99% of use cases because extra memory will inflict a very large number of TLB misses during the page fault processing or during the context switching which will slow memory access down substantially unless:

1) the TLB size in each of the L1/L2/… caches is increased; AND

2) the page size is increased, or the page size can be configured in the CPU.

Earlier versions of MIPS CPU's had a software controlled, very small sized TLB and were notorious for being slow with the memory access. Starting with A14, Apple has increased an already massive TLB, which was on top of the page size having been increased from 4kB to 16kB:

«The L1 TLB has been doubled from 128 pages to 256 pages, and the L2 TLB goes up from 2048 pages to 3072 pages. On today’s iPhones this is an absolutely overkill change as the page size is 16KB, which means that the L2 TLB covers 48MB which is well beyond the cache capacity of even the A14» [0].

It would be interesting to find out whether the TLB size is even larger in M1 Pro/Max CPU's.

[0] https://www.anandtech.com/show/16226/apple-silicon-m1-a14-de...


I think we’re losing the perspective here that Apple is not in the business of selling chips, but rather in the business of selling laptop to professional who would never even need what you describe




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: