Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It's my strong opinion that Von Neumann's architecture is great for general purpose problem solving. However for computing LLMs and similar fully known execution plans, it would be far better to decompose them into a fully parallel and pipelined graph to be executed on a reconfigurable computing mesh.

My particular hobby horse is the BitGrid, a systolic array of 4x4 bit look up tables clocked in 2 phases to eliminate race conditions.

My current estimate is that it could save 95% of the energy for a given computation.

Getting rid of RAM, and only moving data between adjacent cells offers the ability to really jack up the clock rate because you're not driving signal across the die.



For sure. The issue is that many AI workloads require terabytes per second of memory bandwidth and are on the cutting edge of memory technologies. As long as you can get away with little memory usage you can have massive savings, see Bitcoin ASICs.

The great thing with the Von Neumann architecture is it is flexible enough to have all sorts of operations added to it; Including specialized matrix multiplication operations and async memory transfer. So I think its here to stay.


The reason they need terabytes per second is they're constantly loading weights and data into and then back out of multiply accumulator chips.

If you program hardware to just do the multiply/accumulate with the weights built in, you can then reduce the bandwidth required to just putting data into a "layer" and getting the results out, a much, MUCH lower amount of data in most cases.


Doesn't this presume that you can fit the model's weights into the SRAMs of a single chip, or of multiple chips that you plan to connect together? A big reason HBM is a thing is because it's much, much denser than SRAM.


That sounds like an FPGA, but aren't those notoriously slow for most uses? You can make a kickass signal pipeline or maybe an emulator but most code gets shunted to an attached arm core or even a soft-core implemented on top that wastes an order of magnitude of performance.

And no architecture is clock limited by driving signals across the die. Why do people keep assuming this? CPU designers are very smart and they break slow things into multiple steps.


> That sounds like an FPGA, but aren't those notoriously slow for most uses?

If you're trying to run software on them yes. If you use them for their stated purpose (as an array of gates), then they can be orders of magnitude faster then the equivalent computer program. It's all about using the right tool for the right job.


occam (MIMD) is an improvement over CUDA (SIMD)

with latest (eg TSMC) processes, someone could build a regular array of 32-bit FP transputers (T800 equivalent):

  -  8000 CPUs in same die area as an Apple M2 (16 TIPS) (ie. 36x faster than an M2)
  - 40000 CPUs in single reticle (80 TIPS)
  - 4.5M CPUs per 300mm wafer (10 PIPS)
the transputer async link (and C001 switch) allows for decoupled clocking, CPU level redundancy and agricultural interconnect

heat would be the biggest issue ... but >>50% of each CPU is low power (local) memory


Nit pick here but ...

I think CUDA shouldn't be label as SIMD but SIMT. The difference in overhead between the two approaches is vast. A true Vector machine is far more efficient but with all of the massive headaches of actually programming it. CUDA and SIMT has a huge benefit in that if statement actually execute different codes for active/inactive bins. I.e different instructions execute on the same data in some cases which really aids. Your view might also be the same instructions operate on different datas but the fork and join nature behaves very different.

I enjoyed your other point though about the comparsions of machines though


Really curious about why you think programming a vector machine is so painful ? In terms of what ? And what do you exactly mean by a "true Vector Machine" ?

My experience with RVV (i am aware of vector architecture history, just using it as an example)so far indicates that while it is not the greatest thing, it is not that bad either. You play with what you have!!

Yes, compared to regular SIMD, it is a step up in complexity but nothing a competent SIMD programmer cannot reorient to. Designing a performant hardware cpu is another matter though - lot of (micro)architectural choices and tradeoffs that can impact performance significantly.


No it in terms of flexibility of programming the SIMD is less flexible if you need any decision making. In SIMD you are also typically programming in Intrinsics not at a higher level like in CUDA.

For example I can do for (int tid = 0 ; tid<n; tid+= num_threads) { C[tid] = A[tid] * B[tid] + D[tid]; }

In SIMD yes I can stride the array 32 / (or in rvv at vl) at a time but generally speaking as new Archs come along I need to rewrite that loop for the wider Add and mpy instructions and increase width of lanes etc. But in CUDA or other GPU SIMT strategies I just need to bump the compiler and maybe change 1 num_threads variable and it will be vectorizing correctly.

Even things like RVV which I am actually pushing for my SIMD machine to move toward these problems exists because its really hard to write length agnostic code in SIMD intrinsics. That said there is a major benefit in terms of performance per watt. All that SIMT flexibility costs power that's why Nvidia GPUs can burn a hole through the floor while the majority of phones have a series of vector SIMD machines that are constantly computing Matrix and FFT operations without your pocket becoming only slightly warmer




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: