Top researchers leave Intel to build startup with 'the biggest, baddest CPU'

zackmorris · 2025-06-06T18:36:40 1749235000

I hope they design, build and sell a true 256-1024+ multicore CPU with local memories that appears as an ordinary desktop computer with a unified memory space for under $1000.

I've written about it at length and I'm sure that anyone who's seen my comments is sick of me sounding like a broken record. But there's truly a vast realm of uncharted territory there. I believe that transputers and reprogrammable logic chips like FPGAs failed because we didn't have languages like Erlang/Go and GNU Octave/MATLAB to orchestrate a large number of processes or handle SIMD/MIMD simultaneously. Modern techniques like passing by value via copy-on-write (used by UNIX forking, PHP arrays and Clojure state) were suppressed when mainstream imperative languages using pointers and references captured the market. And it's really hard to beat Amdahl's law when we're worried about side effects. I think that anxiety is what inspired Rust, but there are so many easier ways of avoiding those problems in the first place.

jiggawatts · 2025-06-06T22:32:22 1749249142

Check out the Azure HBv5 servers.

High bandwidth memory on-package with 352 AMD Zen 4 cores!

With 7 TB/s memory bandwidth, it’s basically an x86 GPU.

This is the future of high performance computing. It used to be available only for supercomputers but it’s trickling down to cloud VMs you can rent for reasonable money. Eventually it’ll be standard for workstations under your desk.

nullc · 2025-06-07T05:57:16 1749275836

it's kind of concerning that it's only available as a hosted product. Not good news for anyone that needs to run on-prem for confidentiality or availability reasons.

jiggawatts · 2025-06-08T01:34:52 1749346492

I’m guessing the unit price is about $100K or higher, and it’s a niche product typically only used for HPC. If you’re a big enough customer, I’m sure you could buy these but I suspect the minimum order size is measured in the thousands of hosts.

zackmorris · 2025-06-09T17:40:26 1749490826

Thanks to you and the other commenters!

-

I just want to leave this breadcrumb showing possible markets and applications for high-performance computing (HPC), specifically regarding SpiNNaker which is simulating neural nets (NNs) as processes communicating via spike trains rather than matrices performing gradient descent:

https://news.ycombinator.com/item?id=44201812 (Sandia turns on brain-like storage-free supercomputer)

https://blocksandfiles.com/2025/06/06/sandia-turns-on-brain-... (working implementation of 175,000 cores)

https://www.theregister.com/2017/10/19/steve_furber_arm_brai... (towards 1 million+ cores)

https://www.youtube.com/watch?v=z1_gE_ugEgE (518,400 cores as of 2016)

https://arxiv.org/pdf/1911.02385 (towards 10 million+ cores)

https://docs.hpc.gwdg.de/services/neuromorphic-computing/spi... (HPC programming model)

I'd use a similar approach but probably add custom memory controllers that calculate hashes for a unified content-addressable memory, so that arbitrary network topologies can be used. That way the computer could be expanded as necessary and run over the internet without modification. I'd also write something like a microkernel to expose the cores and memory as a unified desktop computing environment, then write the Python HPC programming model over that and make it optional. Then users could orchestrate the bare metal however they wish with containers, forked processes, etc.

-

A possible threat to the HPC market would be to emulate MIMD under SIMD by breaking ordinary imperative machine code up into parallelizable immutable (functional) sections bordered by IO handled by some kind of monadic or one-shot logic that prepares inputs and obtains outputs between the functional portions. That way individual neurons, agents for genetic algorithms, etc could be written in C-style or Lisp-style code that's transpiled to run on SIMD GPUs. This is an open problem that I'm having trouble finding published papers for:

https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4611137 (has PDF preview and download)

Without code examples, I'd estimate MIMD->SIMD performance to be between 1-2 orders of magnitude faster than a single-threaded CPU and 1-2 orders of magnitude slower than a GPU. Similar to scripting languages vs native code. My spidey sense is picking up so many code smells around this approach though that I suspect it may never be viable.

-

I'd compare the current complexities around LLMs running on SIMD GPUs to trying to implement business logic as a spaghetti of state machines instead of coroutines running conditional logic and higher-order methods via message passing. Loosely that means that LLMs will have trouble evolving and programming their own learning models. Whereas HPC doesn't have those limitations, because potentially every neuron can learn and evolve on its own like in the real world.

So a possible bridge between MIMD and SIMD would be to transpile CPU machine code coroutines to GPU shader state machines:

https://news.ycombinator.com/item?id=18704547

https://eli.thegreenplace.net/2009/08/29/co-routines-as-an-a...

In the end, they're equivalent. But a multi-page LLM specification could be reduced down to a bunch of one-liners because we can reason about coroutines at a higher level of abstraction than state machines.

zozbot234 · 2025-06-06T18:43:53 1749235433

If you have 256-1024+ multicore CPUs they will probably have a fake unified memory space that's really a lot more like NUMA underneath. Not too different from how GPU compute works under the hood. And it would let you write seamless parallel code by just using Rust.

david-gpu · 2025-06-07T20:07:58 1749326878

The challenges that arise when you have a massively parallel system are well understood by now. It is hard to keep all processing units doing something useful rather than waiting for memory or other processing units.

Once you follow the logical steps to increase utilization/efficiency you end up with something like a GPU, and that comes with the programming challenges that we have today.

In other words, it's not like CPU architects didn't think of that. Instead, there are good reasons for the status quo.

anthk · 2025-06-06T23:04:44 1749251084

Forth can.

johnklos · 2025-06-06T18:07:01 1749233221

One of the biggest problems with CPUs is legacy. Tie yourself to any legacy, and now you're spending millions of transistors to make sure some way that made sense ages ago still works.

Just as a thought experiment, consider the fact that the i80486 has 1.2 million transistors. An eight core Ryzen 9700X has around 12 billion. The difference in clock speed is roughly 80 times, and the difference in number of transistors is 1,250 times.

These are wild generalizations, but let's ask ourselves: If a Ryzen takes 1,250 times the transistor for one core, does one core run 1,250 times (even taking hyperthreading in to account) faster than an i80486 at the same clock? 500 times? 100 times?

It doesn't, because massive amounts of those transistors go to keeping things in sync, dealing with changes in execution, folding instructions, decoding a horrible instruction set, et cetera.

So what might we be able to do if we didn't need to worry about figuring out how long our instructions are? Didn't need to deal with Spectre and Meltdown issues? If we made out-of-order work in ways where much more could be in flight and the compilers / assemblers would know how to avoid stalls based on dependencies, or how to schedule dependencies? What if we took expensive operations, like semaphores / locks, and built solutions in to the chip?

Would we get to 1,250 times faster for 1,250 times the number of transistors? No. Would we get a lot more performance than we get out of a contemporary x86 CPU? Absolutely.

AnthonyMouse · 2025-06-06T18:41:34 1749235294

Modern CPUs don't actually execute the legacy instructions, they execute core-native instructions and have a piece of silicon dedicated to translating the legacy instructions into them. That piece of silicon isn't that big. Modern CPUs use more transistors because transistors are a lot cheaper now, e.g. the i486 had 8KiB of cache, the Ryzen 9700X has >40MiB. The extra transistors don't make it linearly faster but they make it faster enough to be worth it when transistors are cheap.

Modern CPUs also have a lot of things integrated into the "CPU" that used to be separate chips. The i486 didn't have on-die memory or PCI controllers etc., and those things were themselves less complicated then (e.g. a single memory channel and a shared peripheral bus for all devices). The i486SX didn't even have a floating point unit. The Ryzen 9000 series die contains an entire GPU.

Sohcahtoa82 · 2025-06-06T18:36:41 1749235001

> If a Ryzen takes 1,250 times the transistor for one core, does one core run 1,250 times (even taking hyperthreading in to account) faster than an i80486 at the same clock? 500 times? 100 times?

Would be interesting to see a benchmark on this.

If we restricted it to 486 instructions only, I'd expect the Ryzen to be 10-15x faster. The modern CPU will perform out-of-order execution with some instructions even run in parallel, even in single-core and single-threaded execution, not to mention superior branch prediction and more cache.

If you allowed modern instructions like AVX-512, then the speedup could easily be 30x or more.

> Would we get to 1,250 times faster for 1,250 times the number of transistors? No. Would we get a lot more performance than we get out of a contemporary x86 CPU? Absolutely.

I doubt you'd get significantly more performance, though you'd likely gain power efficiency.

Half of what you described in your hypothetical instruction set are already implemented in ARM.

ahartmetz · 2025-06-07T11:31:53 1749295913

A Ryzen is muuuuch more than 10-15x faster than a 486, and AVX et al do diddly squat for a lot of general-purpose code.

Clock speed is about 50x and IPC, let's say, 5-20x. So it's roughly 500x faster.

Sohcahtoa82 · 2025-06-08T05:33:56 1749360836

I meant a comparison on a clock-for-clock level. In other words, imagine either the 486 running at the clock speed of a Ryzen, or the Ryzen running at the clock speed of the 486. In other other words, compare ONLY IPC.

The line I was commenting on said:

> If a Ryzen takes 1,250 times the transistor for one core, does one core run 1,250 times (even taking hyperthreading in to account) faster than an i80486 at the same clock?

Emphasis added by me.

layla5alive · 2025-06-06T22:39:44 1749249584

In terms of FLOPS, Ryzen is ~1,000,000 times faster than a 486.

For serial branchy code, it isn't a million times faster, but that has almost nothing to do with legacy and everything to do with the nature of serial code and that you can't linearly improve serial execution with architecture and transistor counts (you can sublinearly improve it), but rather with Denard scaling.

It is worth noting, though, that purely via Denard scaling, Ryzen is already >100x faster, though! And via architecture (those transistors) it is several multiples beyond that.

In general compute, if you could clock it down at 33 or 66MHz, a Ryzen would be much faster than a 486, due to using those transistors for ILP (instruction-level parallelism) and TLP (thread-level parallelism). But you won't see any TLP in a single serial program that a 486 would have been running, and you won't get any of the SIMD benefits either, so you won't get anywhere near that in practice on 486 code.

The key to contemporary high performance computing is having more independent work to do, and organizing the data/work to expose the independence to the software/hardware.

Szpadel · 2025-06-06T20:12:43 1749240763

that's exactly why Intel proposed x86S

that's basically x86 without 16 and 32 bit support, no real mode etc.

CPU starts initialized in 64bit without all that legacy crap.

that's IMO great idea. I think every few decades we need to stop and think again about what works best and take fresh start or drop some legacy unused features.

risc v have only mandatory basic set of instructions, as little as possible to be Turing complete and everything else is extension that can be (theoretically) removed in the future.

this also could be used to remove legacy parts without disrupting architecture

kvemkon · 2025-06-06T18:37:48 1749235068

Would be interesting to compare transistor count without L3 (and perhaps L2) cache.

16-core Zen 5 CPU achieves more than 2 TFLOPS FP64. So number crunching performance scaled very well.

It is weird, that the best consumer GPU can 4 TFLOPS. Some years ago GPUs were an order of magnitude and more faster than CPUs. Today GPUs are likely to be artificially limited.

kvemkon · 2025-06-06T18:49:22 1749235762

E.g. AMD Radeon PRO VII with 13.23 billion transistors achieves 6.5 TFLOPS FP64 in 2020 [1].

[1] https://www.techpowerup.com/gpu-specs/radeon-pro-vii.c3575

zozbot234 · 2025-06-06T19:00:15 1749236415

> 16-core Zen 5 CPU achieves more than 2 TFLOPS FP64. So number crunching performance scaled very well.

These aren't realistic numbers in most cases because you're almost always limited by memory bandwidth, and even if memory bandwidth is not an issue you'll have to worry about thermals. Theoretical CPU compute ceiling is almost never the real bottleneck. GPU's have a very different architecture with higher memory bandwidth and running their chips a lot slower and cooler (lower clock frequency) so they can reach much higher numbers in practical scenarios.

kvemkon · 2025-06-06T20:02:43 1749240163

Sure, not for BLAS Level 1 and 2 operations. But not even for Level 3?

layla5alive · 2025-06-06T22:19:14 1749248354

Huh, consumer GPUs are doing Petaflops of floating point. FP64 isn't a useful comparison because FP64 is nerfed on consumer GPUs.

kvemkon · 2025-06-06T22:32:50 1749249170

Even recent nVidia 5090 has 104.75 TFLOPS FP32.

It's useful comparison in terms of achievable performance per transistor count.

saati · 2025-06-06T20:20:35 1749241235

But history showed exactly the opposite, if you don't have an already existing software ecosystem you are dead, the transistors for implementing x86 peculiarities are very much worth it if people in the market want x86.

colechristensen · 2025-06-06T18:28:33 1749234513

GPUs scaled wide with a similar number of transistors to a 486 and just lots more cores, thousands to tens of thousands of cores averaging out to maybe 5 million transistors per core.

CPUs scaled tall with specialized instruction to make the single thread go faster, no the amount done per transistor does not scale anywhere near linearly, very many of the transistors are dark on any given cycle compared to a much simpler core that will have much higher utilization.

zozbot234 · 2025-06-06T18:30:46 1749234646

> Didn't need to deal with Spectre and Meltdown issues? If we made out-of-order work in ways where much more could be in flight and the compilers / assemblers would know how to avoid stalls based on dependencies, or how to schedule dependencies? What if we took expensive operations, like semaphores / locks, and built solutions in to the chip?

I'm pretty sure that these goals will conflict with one another at some point. For example, the way one solves Spectre/Meltdown issues in a principled way is by changing the hardware and system architecture to have some notion of "privacy-sensitive" data that shouldn't be speculated on. But this will unavoidably limit the scope of OOO and the amount of instructions that can be "in-flight" at any given time.

For that matter, with modern chips, semaphores/locks are already implemented with hardware builtin operations, so you can't do that much better. Transactional memory is an interesting possibility but requires changes on the software side to work properly.

AtlasBarfed · 2025-06-06T23:58:08 1749254288

If you have a very large CPU count, then I think you can dedicate a CPU to only process a given designated privacy/security focused execution thread. Especially for a specially designed syscall, perhaps

That kind of takes the specter meltdown thing out of the way to some degree I would think, although privilege elevation can happen in the darndest places.

But maybe I'm being too optimistic

hnaccount_rng · 2025-06-07T06:20:58 1749277258

Isn't the problem the "labelling" of "privacy-sensitive" in the first place?

dist-epoch · 2025-06-06T18:35:07 1749234907

If you look at a Zen5 die shots half of the space is taken by L3 cache.

And from each individual core:

- 25% per core L1/L2 cache

- 25% vector stuff (SSE, AVX, ...)

- from the remaining 50% only about 20% is doing instruction decoding

https://www.techpowerup.com/img/AFnVIoGFWSCE6YXO.jpg

zozbot234 · 2025-06-06T18:39:32 1749235172

The real issue with complex insn decoding is that it's hard to make the decode stage wider and at some point this will limit the usefulness of a bigger chip. For instance, AArch64 chips tend to have wider decode than their close x86_64 equivalents.

epx · 2025-06-06T18:40:35 1749235235

Aren't 99,99999% of these transistors used in cache?

nomel · 2025-06-06T21:19:30 1749244770

Look up "CPU die diagram". You'll see the physical layout of the CPU with annotated blocks.

Zen 3 example: https://www.reddit.com/r/Amd/comments/jqjg8e/quick_zen3_die_...

So, more like 85%, or around 6 orders of magnitude difference from your guess. ;)

PopePompus · 2025-06-06T18:53:16 1749235996

Gosh no. Often a majority of the transistors are used in cache, but not 99%.

Const-me · 2025-06-07T17:15:18 1749316518

CPUs can’t do that, but legacy is irrelevant. They just don’t have enough parallelism to leverage all these extra transistors. Let’s compare the 486 with a modern GPU.

Intel 80486 with 1.2M transistors delivered 0.128 flops / cycle.

nVidia 4070 Ti Super with 45.9B transistors delivers 16896 flops / cycle.

As you see, each transistor became 3.45 times more efficient at delivering these FLOPs per cycle.

johnklos · 2025-06-06T18:26:14 1749234374

> and the difference in number of transistors is 1,250 times

I should've written per core.

amelius · 2025-06-06T18:39:30 1749235170

> and now you're spending millions of transistors

and spending millions on patent lawsuits ...

smegger001 · 2025-06-07T01:26:38 1749259598

correct me if i am wrong but isn't that what was tried with the Intel Itanium processors line, only the smarter compilers and assemblers never quiet got there.

what makes it more likely to work this time?

PhilipRoman · 2025-06-07T12:23:32 1749299012

Optimizing compiler technology was still in the stone age (arguably still is) when Itanium was released. LLVM had just been born and GCC didn't start using SSA until 2005. Egraphs were unheard of in context of compiler optimization.

That said, yesterday I saw gcc generate 5 KB of mov instructions because it couldn't gracefully handle a particular vector size so I wouldn't get my hopes up...

joshstrange · 2025-06-07T12:13:54 1749298434

I tire of “Employees from Y company leave to start their own” and even “Ex-Y employees launch new W”.

How many times do we have to see these stories play out to realize it doesn’t matter where they came from. These big companies employee a lot of people of varying skill, having it on your resume means almost nothing IMHO.

Just look at the Humane pin full of “ex-Apple employees”, how’d that work out? And that’s only one small example.

I hope IO (OpenAi/Jony Ive) fails so spectacularly so that we have an even better example to point to and we can dispel the idea that if you did something impressive early in your career or worked for an impressive company, it doesn’t mean you will continue to do so.

nchmy · 2025-06-07T12:23:48 1749299028

I immediately redflag anyone who advertises themselves as "ex-company". It shows a lack of character, judgment, and, probably, actual results/contributions. Likewise, it shows that they're probably not a particularly independent thinker - they're just following the herd of people who describe themselves like that (whose ven diagram surely overlaps considerably with people who describe themselves as "creatives" - as if a car mechanic working on a rusty bolt or a kindergarten teacher, or anyone else, is not creative.

Moreover, if the ex company was so wonderful and they were so integral to it, why aren't they still there? If they did something truly important, why not just advertise that (and I'm putting aside here qualms about overt advertising rather than something more subtle, authentic, organic).

627467 · 2025-06-08T11:58:35 1749383915

Yeah... "Combined 100 years of experience" and in previous article [0] it was "combined 80+ years" for the same people... What happened there? Accelerated aging?

[0] https://news.ycombinator.com/item?id=41353155

lofaszvanitt · 2025-06-07T15:27:29 1749310049

There are other forces at play. Same happens with video games. People can't see how many "variables" must be tuned in order to become notified and successful.

kleiba · 2025-06-06T17:32:22 1749231142

Good luck not infringing on any patents!

And that's not sarcasm, I'm serious.

neuroelectron · 2025-06-06T18:13:46 1749233626

Intel restructures into patent troll, hiring reverse engineers and investing in chip sanding and epoxy acids.

Ocha · 2025-06-06T15:00:43 1749222043

https://archive.ph/BSKSq

Foobar8568 · 2025-06-06T16:10:58 1749226258

Bring back the architecture madness era of the 80s/90s.

jmclnx · 2025-06-06T15:17:55 1749223075

>AheadComputing is betting on an open architecture called RISC-V

I wish them success, plus I hope they do not do what Intel did with its add-ons.

Hoping for an open system (which I think RISC-V is) and nothing even close to Intel ME or AMT.

https://en.wikipedia.org/wiki/Intel_Management_Engine

https://en.wikipedia.org/wiki/Intel_Active_Management_Techno...

constantcrying · 2025-06-06T18:49:59 1749235799

>Hoping for an open system (which I think RISC-V is) and nothing even close to Intel ME or AMT.

The architecture is independent of additional silicon with separate functions. The "only" thing which makes RISC-V open are that the specifications are freely available and freely usable.

Intel ME is, by design, separate from the actual CPU. Whether the CPU uses x86 or RISC-V is essentially irrelevant.

aesbetic · 2025-06-06T16:14:20 1749226460

This is more a bad look for Intel than anything truly exciting since they refuse to produce any details lol

esafak · 2025-06-06T15:12:18 1749222738

Can't they make a GPU instead? Please save us!

AlotOfReading · 2025-06-06T15:34:09 1749224049

A GPU is a very different beast that relies much more heavily on having a gigantic team of software developers supporting it. A CPU is (comparatively) straightforward. You fab and validate a world class design, make sure compiler support is good enough, upstream some drivers and kernel support, and make sure the standard documentation/debugging/optimization tools are all functional. This is incredibly difficult, but achievable because these are all standardized and well understood interface points.

With GPUs you have all these challenges while also building a massively complicated set of custom compilers and interfaces on the software side, while at the same time trying to keep broken user software written against some other company's interface not only functional, but performant.

esafak · 2025-06-06T15:41:02 1749224462

It's not the GPU I want per se but its ability to run ML tasks. If you can do that with your CPU fine!

AlotOfReading · 2025-06-06T16:30:35 1749227435

Echoing the other comment, this isn't easier. I was on a team that did it. The ML team was overheard by media complaining that we were preventing them from achieving their goals because we had taken 2 years to build something that didn't beat the latest hardware from Nvidia, let alone keep pace with how fast their demands had grown.

mdaniel · 2025-06-06T20:25:53 1749241553

I don't need it to beat the latest from nvidia, just be affordable, available, and have user servicable ram slots so "48gb" isn't such an ooo-ahh amount of memory

I couldn't find any buy it now links but 512gb sticks don't seem to be fantasies, either: https://news.samsung.com/global/samsung-develops-industrys-f...

mdaniel · 2025-06-07T01:50:40 1749261040

Now that I'm back at my computer, can search harder and it seems one can legitimately buy 256GB sticks at approximately USD$2000 a pop <https://www.ebay.com/itm/267177294719> or 128GB for $790 <https://www.ebay.com/itm/205354052535>

Since it seems A100s top out at 80GB, and appear to start at $10,000 I'd say it's a steal

Yes, I'm acutely aware that bandwidth matters, but my mental model is the rest of that sentence is "up to a point," since those "self hosted LLM" threads are filled to the brim with people measuring tokens-per-minute or even running inference on CPU

I'm not hardware adjacent enough to try such a stunt, but there was also recently a submission of a BSD-3-Clause implementation of Google's TPU <https://news.ycombinator.com/item?id=44111452>

AlotOfReading · 2025-06-07T20:37:56 1749328676

What would you want from on-gpu slots that you don't get from existing mechanisms to give your GPU access to system memory?

mdaniel · 2025-06-08T01:01:14 1749344474

prelude: I realized that I typed out a ton of words but in the end engineering is all about tradeoffs, so, fine if there's a way I can teach some existing GPU, or some existing PCIe TPU, to access system RAM over an existing PCIe slot, that sounds like a fine step forward. I just don't have a lot of experience in that setup to know if only certain video cards allow that or what

Bearing in mind the aforementioned "I'm not a hardware guy," my mental model of any system RAM access for GPUs is:

  1. copy weights from SSD to RAM
  2. trigger GPU with that RAM location
  3. GPU copies weights over PCIe bus to do calculation
  4. GPU copies activations over PCIe bus back to some place in RAM
  5. goto 3

If my understanding is correct, this PCIe (even at 16 lanes) is still shared with everything else on the motherboard that is also using PCIe, to say nothing of the actual protocol handshaking since it's a common bus and thus needs contention management. I would presume doing such a stunt would at bare minimum need to contend with other SSD traffic and the actual graphical part of the GPU's job[1][2]

Contrast this with memory socket(s) on the "GPU's mainboard" where it is, what, 3mm of trace wires away from ripping the data back and forth between its RAM and its processors, only choosing to PCIe the result out to RAM. It can have its own PCIe to speak to other sibling GPGPU setups for doing multi-device inference[3]

I would entertain people saying "but what a waste having 128GB of RAM only usable for GPGPU tasks" but if all these folks are right in claiming that it's the end of software engineering as we know it, I would guess it's not going to be that idle

1: I wish I had actually made a bigger deal out of wanting a GPGPU since for this purpose I don't care at all what DirectX or Vulkan whatever it runs

2: furthermore, if the "just use system RAM" was such a hot idea, I don't think it would be 2025 and we still have graphics cards with only 8GB of RAM on them. I'm not considering the Apple architecture because they already solder RAM and mark it up so much that normal people can't afford a sane system anyway, so I give no shits how awesome their unified architecture is

3: I also should have drew more attention to the inference need, since AIUI things like the TPUs I have on my desk aren't (able to do|good at) training jobs but that's where my expertise grinds to a halt because I have no idea why that is or how to fix it

AlotOfReading · 2025-06-08T04:11:48 1749355908

Oh, it's not a good idea at all from a performance perspective to use system memory because it's slow as heck. The important thing is that you can do it. Some way of allowing the GPU to page in data from system RAM (or even storage) on an as-needed basis has been supported by Nvidia since at least Tesla generation.

There's actually a multitude of different ways now that each have their own performance tradeoffs like direct DMA from the Nvidia card, data copied via CPU, GPU direct storage, and so on. You seem to understand the gist though, so these are mainly implementation details. Sometimes there's weird limitations with one method like limited to Quadro, or only up to a fixed percentage of system memory.

The short answer is that all of them suck to different degrees and you don't want to use them if possible. They're enabled by default for virtually all systems because they significantly simplify CUDA programming. DDR is much less suitable than GDDR for feeding a bandwidth hungry monster like a GPU, PCI introduces high latency and further constructions, and any CPU involvement is a further slowdown. This would also apply to socketed memory on a GPU though: Significantly slower and less bandwidth.

There's also some additional downsides to accessing system RAM that we don't need to get into, like sometimes losing the benefits of caching and getting full cost memory accesses every time.

mdaniel · 2025-06-08T17:08:32 1749402512

That's interesting, thanks for making me aware. I'll try to dig up some reading material, but in some sense this is going the opposite of how I want the world to work because nvidia is already a supply chain bottleneck and so therefore saying "the solution to this supply and demand is more CUDA" doesn't get me where I want to go

> any CPU involvement is a further slowdown. This would also apply to socketed memory on a GPU though: Significantly slower and less bandwidth

I am afraid what I'm about to say doubles down on my inexperience, but: I could have sworn that series of problems is what DMA was designed to solve: peripherals do their own handshaking without requiring the CPU's involvement (aside from the "accounting" bits of marking regions as in-use). And thus if a GPGPU comes already owning its own RAM, it most certainly does not need to ask the CPU to do jack squat to talk to its own RAM because there's no one else who could possibly be using it

I was looking for an example of things that carried their own RAM and found this, which strictly speaking is what I searched for but is mostly just funny so I hope others get a chuckle too: a SCSI ram disk <https://micha.freeshell.org/ramdisk/RAM_disk.jpg>

AlotOfReading · 2025-06-08T17:23:16 1749403396

Sorry if that was confusing. I was trying to communicate a generality about multiple very different means of accessing the memory: the way we currently build GPUs is a local maximum for performance. Changing anything, even putting dedicated memory on sockets, has a dramatic and negative impact on performance. The latest board I've worked on saw the layout team working overtime to place the memory practically on top of the chip and they were upset it couldn't be closer.

Also, other systems have similar technologies, I'm just mentioning Nvidia as an example.

kvemkon · 2025-06-06T22:49:41 1749250181

And now, 4 years later, I still can choose only among micron and hynix for consumer DDR5 DIMM. No samsung or nanya which I could order right now.

While micron (crucial) 64GB DDR5 (SO-)DIMMs are available since few months.

mort96 · 2025-06-06T15:48:44 1749224924

Well that's even more difficult because not only do you need drivers for the widespread graphics libraries Vulkan, OpenGL and Direct3D, but you also need to deal with the GPGPU mess. Most software won't ever support your compute-focused GPU because you won't support CUDA.

Bolwin · 2025-06-06T17:47:37 1749232057

I mean you most certainly can. Pretty much every ml library has cpu support

esafak · 2025-06-06T18:31:50 1749234710

Not theoretically, but practically, viably.

Asraelite · 2025-06-06T15:47:58 1749224878

> make sure compiler support is good enough

Do compilers optimize for specific RISC-V CPUs, not just profiles/extensions? Same for drivers and kernel support.

My understanding was that if it's RISC-V compliant, no extra work is needed for existing software to run on it.

Arnavion · 2025-06-06T16:59:26 1749229166

You want to optimize for specific chips because different chips have different capabilities that are not captured by just what extensions they support.

A simple example is that the CPU might support running two specific instructions better if they were adjacent than if they were separated by other instructions ( https://en.wikichip.org/wiki/macro-operation_fusion ). So the optimizer can try to put those instructions next to each other. LLVM has target features for this, like "lui-addi-fusion" for CPUs that will fuse a `lui; addi` sequence into a single immediate load.

A more complex example is keeping track of the CPU's internal state. The optimizer models the state of the CPU's functional units (integer, address generation, etc) so that it has an idea of which units will be in use at what time. If the optimizer has to allocate multiple instructions that will use some combination of those units, it can try to lay them out in an order that will minimize stalling on busy units while leaving other units unused.

That information also tells the optimizer about the latency of each instruction, so when it has a choice between multiple ways to compute the same operation it can choose the one that works better on this CPU.

See also: https://myhsu.xyz/llvm-sched-model-1/ https://myhsu.xyz/llvm-sched-model-1.5/

If you don't do this your code will still run on your CPU. It just won't necessarily be as optimal as it could be.

Bolwin · 2025-06-06T17:46:48 1749232008

Wonder if we could generalize this so you can just give the optimizer a file containing all this info, without needing to explicitly add support for each cpu

frankchn · 2025-06-06T18:39:50 1749235190

These configuration files exist (https://llvm.org/docs/TableGen/, https://github.com/llvm/llvm-project/blob/main/llvm/lib/Targ...) but it is very complicated because the processors themselves are very complicated.

AlotOfReading · 2025-06-06T16:11:38 1749226298

The major compilers optimize for microarchitecture, yes. Here's the tablegen scheduling definition behind LLVM's -mtune=sifive-670 flag as an example: https://github.com/llvm/llvm-project/blob/main/llvm/lib/Targ...

It's not that things won't run, but this is necessary for compilers to generate well optimized code.

speedgoose · 2025-06-06T15:53:06 1749225186

I hope to see dedicated GPU coprocessors disappear sooner rather than later, just like arithmetic coprocessors did.

wtallis · 2025-06-06T16:13:32 1749226412

Arithmetic co-processors didn't disappear so much as they moved onto the main CPU die. There were performance advantages to having the FPU on the CPU, and there were no longer significant cost advantages to having the FPU be separate and optional.

For GPUs today and in the foreseeable future, there are still good reasons for them to remain discrete, in some market segments. Low-power laptops have already moved entirely to integrated GPUs, and entry-level gaming laptops are moving in that direction. Desktops have widely varying GPU needs ranging from the minimal iGPUs that all desktop CPUs now already have, up to GPUs that dwarf the CPU in die and package size and power budget. Servers have needs ranging from one to several GPUs per CPU. There's no one right answer for how much GPU to integrate with the CPU.

otabdeveloper4 · 2025-06-06T18:24:40 1749234280

By "GPU" they probably mean "matrix multiplication coprocessor for AI tasks", not actually a graphics processor.

wtallis · 2025-06-06T19:16:56 1749237416

That doesn't really change anything. The use cases for a GPU in any given market segment don't change depending on whether you call it a GPU.

And for low-power consumer devices like laptops, "matrix multiplication coprocessor for AI tasks" is at least as likely to mean NPU as GPU, and NPUs are always integrated rather than discrete.

otabdeveloper4 · 2025-06-07T15:07:13 1749308833

Yes it does change something.

A GPU needs to run $GAME from $CURRENT_YEAR at 60 fps despite the ten million SLoC of shit code and legacy cruft in $GAME. That's where the huge expense for the GPU manufacturer lies.

Matrix multiplication is a solved probelm and we need to implement it just once in hardware. At some point matrix multiplication will be ubiquitous like floating-point is now.

wtallis · 2025-06-07T23:28:35 1749338915

You're completely ignoring that there are several distinct market segments that want hardware to do AI/ML. Matrix multiplication is not something you can implement in hardware just once.

NVIDIA's biggest weakness right now is that none of their GPUs are appropriate for any system with a lower power budget than a gaming laptop. There's a whole ecosystem of NPUs in phone and laptop SoCs targeting different tradeoffs in size, cost, and power than any of NVIDIA's offerings. These accelerators represent the biggest threat NVIDIA's CUDA monopoly has ever faced. The only response NVIDIA has at the moment is to start working with MediaTek to build laptop chips with NVIDIA GPU IP and start competing against pretty much the entire PC ecosystem.

At the same time, all the various low-power NPU architectures have differing limitations owing to their diverse histories, and approximately none of them currently shipping were designed from the beginning with LLMs in mind. On the timescale of hardware design cycles, AI is still a moving target.

So far, every laptop or phone SoC that has shipped with both an NPU and a GPU has demonstrated that there are some AI workloads where the NPU offers drastically better power efficiency. Putting a small-enough NVIDIA GPU IP block onto a laptop or phone SoC probably won't be able to break that trend.

In the datacenter space, there are also tradeoffs that mean you can't make a one-size-fits-all chip that's optimal for both training and inference.

In the face of all the above complexity, the question of whether a GPU-like architecture retains any actual graphics-specific hardware is a silly question. NVIDIA and AMD have both demonstrated that they can easily delete that stuff from their architectures to get more TFLOPs for general compute workloads using the same amount of silicon.

touisteur · 2025-06-06T19:35:07 1749238507

Wondering how you'd classify Gaudi, tenstorrent-stuff, groq, or lightmatter's photonic thing.

Calling something a GPU tends to make people ask for (good, performant) support for opengl, Vulkan, direct3d... which seem like a huge waste of effort if you want to be an "AI-coprocessor".

wtallis · 2025-06-06T19:57:33 1749239853

> Wondering how you'd classify Gaudi, tenstorrent-stuff, groq, or lightmatter's photonic thing.

Completely irrelevant to consumer hardware, in basically the same way as NVIDIA's Hopper (a data center GPU that doesn't do graphics). They're ML accelerators that for the foreseeable future will mostly remain discrete components and not be integrated onto Xeon/EPYC server CPUs. We've seen a handful of products where a small amount of CPU gets grafted onto a large GPU/accelerator to remove the need for a separate host CPU, but that's definitely not on track to kill off discrete accelerators in the datacenter space.

> Calling something a GPU tends to make people ask for (good, performant) support for opengl, Vulkan, direct3d... which seem like a huge waste of effort if you want to be an "AI-coprocessor".

This is not a problem outside the consumer hardware market.

otabdeveloper4 · 2025-06-07T15:21:10 1749309670

Consumer hardware and AI inference are joined at the hip right now due to perverse historical reasons.

AI inference's big bottleneck right now is RAM and memory bandwidth, not so much compute per se.

If we redid AI inference from scratch without consumer gaming considerations then it probably wouldn't be a coprocessor at all.

saltcured · 2025-06-06T18:34:53 1749234893

Aspects of this has been happening for a long time, as SIMD extensions and as multi-core packaging.

But, there is much more to discrete GPUs than vector instructions or parallel cores. It's very different memory and cache systems with very different synchronization tradeoffs. It's like an embedded computer hanging off your PCI bus, and this computer does not have the same stable architecture as your general purpose CPU running the host OS.

In some ways, the whole modern graphics stack is a sort of integration and commoditization of the supercomputers of decades ago. What used to be special vector machines and clusters full of regular CPUs and RAM has moved into massive chips.

But as other posters said, there is still a lot more abstraction in the graphics/numeric programming models and a lot more compiler and runtime tools to hide the platform. Unless one of these hidden platforms "wins" in the market, it's hard for me to imagine general purpose OS and apps being able to handle the massive differences between particular GPU systems.

It would easily be like prior decades where multicore wasn't taking off because most apps couldn't really use it. Or where special things like the "cell processor" in the playstation required very dedicated development to use effectively. The heterogeneity of system architectures makes it hard for general purpose reuse and hard to "port" software that wasn't written with the platform in mind.

rjsw · 2025-06-06T16:27:15 1749227235

That was one of the ideas behind Larrabee [1]. You can run Mesa on the CPU today using the llvmpipe backend.

https://en.wikipedia.org/wiki/Larrabee_(microarchitecture)

pstuart · 2025-06-06T17:30:23 1749231023

If Intel were smart (cough), they'd fund lots of skunkworks startups like this that could move quickly and freely, but then be "guided home" into intel once mature enough.

cjbgkagh · 2025-06-06T17:56:58 1749232618

That creates a split between those who get to work on skunk works and those stuck on legacy. It’s very possible to end up with a google like situation where no-one wants to keep the lights on for old projects as doing so would be career suicide. There have been some attempts at other companies at requiring people to have a stake in multiple projects in different stages of the lifecycle but I’ve never seen a stable version of this, as individuals benefit from bending the rules.

pstuart · 2025-06-06T19:30:23 1749238223

Those are valid problems, however, they are not insurmountable.

There's plenty of people who would be fine doing unexciting dead end work if they were compensated well enough (pay, work-life balance, acknowledgement of value, etc).

This is ye olde Creative Destruction dilemma. There's too much inertia and politics internally to make these projects succeed in house. But if a startup was owned by the org and they mapped out a path of how to absorb it after it takes off they then reap the rewards rather than watch yet another competitor eat their lunch.

cjbgkagh · 2025-06-06T21:23:35 1749245015

A spin-out to reacquire. I've seen a lot of outsourcing innovation via startups with much the same effects as skunk works. People at the main company become demoralized that the only way to get anything done is to leave the company, why solve a problem internally when you can do it externally for a whole more money and recognition. The causes brain drain to the point that the execs at the main company become suspicious of anyone who choses to remain long term. It even gets to the point that even after you're acquired it's better to leave and do it over again because the execs will forget you were acquired and start confusing you with their lifers.

The only way I've seen anyone deal with this issue successfully is with rather small companies which don't have nearly as much of the whole agency cost of management to deal with.

mbreese · 2025-06-06T20:07:54 1749240474

Or fund a lot of CPU startups that were tied back to Intel for manufacturing/foundry work. Sure, they could be funding their next big CPU competitor, but they'd still be able to capture the revenue from actually producing the chips.

ngneer · 2025-06-06T19:00:31 1749236431

I wonder if there is any relation to the cancelled Royal and Beast Lake projects.

https://www.notebookcheck.net/Intel-CEO-abruptly-trashed-Roy...

rajnathani · 2025-06-09T10:43:09 1749465789

Side: Just like I mentioned in another HN comment [0] (and got 5-6 upvotes), I wish that HN titles could be expanded to have more necessary information when possible, which in this case is the name of the startup "AheadComputing", and if we're fortunate to even have RISC-V somehow mentioned in the title.

[0] https://news.ycombinator.com/item?id=44105572

saulpw · 2025-06-06T15:49:12 1749224952

The traitorous four.

asplake · 2025-06-06T17:29:02 1749230942

> The traitorous eight was a group of eight employees who left Shockley Semiconductor Laboratory in 1957 to found Fairchild Semiconductor.

https://en.wikipedia.org/wiki/Traitorous_eight

saulpw · 2025-06-06T18:51:59 1749235919

Thanks, I guess that particular history and analogy would not be known universally :)

ahartmetz · 2025-06-07T11:37:00 1749296220

Only two forks left before they'd need to start with half a person.

phendrenad2 · 2025-06-06T20:24:44 1749241484

Previous discussion 9 months ago: https://news.ycombinator.com/item?id=41353155

mixmastamyk · 2025-06-06T15:34:55 1749224095

I was hoping they’d work with existing RV folks rather than starting another one of a dozen smaller attempts. Article says however that Keller from Tenstorrent will be on their board. Good I suppose, but hard to know the ramifications. Why not merge their companies and investments in one direction?

energy123 · 2025-06-06T17:34:21 1749231261

I like the retro-ish and out of trend name they've chosen: AheadComputing.

laughingcurve · 2025-06-06T18:10:11 1749233411

Together Compute

SFCompute

And so on … definitely not out of trend

badc0ffee · 2025-06-07T15:21:00 1749309660

Computing sounds retro, but Compute does not.

logicchains · 2025-06-06T17:21:31 1749230491

I wonder if it'll be ready before the Mill CPU?

guywithahat · 2025-06-06T16:22:48 1749226968

As someone who knows almost nothing about CPU architecture, I've always wondered if there could be a new instruction set, better suited to today's needs. I realize it would require a monumental software effort but most of these instruction sets are decades old. RISC-V is newer but my understanding is it's still based around ARM, just without royalties (and thus isn't bringing many new ideas to the table per say)

jcranmer · 2025-06-06T16:40:57 1749228057

> RISC-V is newer but my understanding is it's still based around ARM, just without royalties (and thus isn't bringing many new ideas to the table per say)

RISC-V is the fifth version of a series of academic chip designs at Berkeley (hence it's name).

In terms of design philosophy, it's probably closest to MIPS of the major architectures; I'll point out that some of its early whitepapers are explicitly calling out ARM and x86 as the kind of architectural weirdos to avoid emulating.

dehrmann · 2025-06-06T16:47:40 1749228460

> I'll point out that some of its early whitepapers are explicitly calling out ARM and x86 as the kind of architectural weirdos to avoid emulating.

Says every new system without legacy concerns.

guywithahat · 2025-06-06T16:52:15 1749228735

Theoretically wouldn't MIPS be worse, since it was designed to help students understand CPU architectures (and not to be performant)?

Also I don't meet to come off confrontational, I genuinely don't know

jcranmer · 2025-06-06T17:09:53 1749229793

The reason why I say RISC-V is probably most influenced by MIPS is because RISC-V places a rather heavy emphasis on being a "pure" RISC design. (Also, RISC-V was designed by a university team, not industry!) Some of the core criticisms of the RISC-V ISA is on it carrying on some of these trends even when experience has suggested that doing otherwise would be better (e.g., RISC-V uses load-linked/store-conditional instead of compare-and-swap).

Given that the core motivation of RISC was to be a maximally performant design for architectures, the authors of RISC-V would disagree with you that their approach is compromising performance.

BitwiseFool · 2025-06-06T17:05:37 1749229537

MIPS was a genuine attempt at creating a new commercially viable architecture. Some of the design goals of MIPS made it conducive towards teaching, namely its relative simplicity and lack of legacy cruft. It was never intended to be an academic only ISA. Although I'm certain the owners hoped that learning MIPS in college would lead to wider industry adoption. That did not happen.

Interestingly, I recently completed a masters-level computer architecture course and we used MIPS. However, starting next semester the class will use RISC-V instead.

zozbot234 · 2025-06-06T18:10:32 1749233432

MIPS has a few weird features such as delay slots, that RISC-V sensibly dispenses with. There's been also quite a bit of convergent evolution in the meantime, such that AArch64 is significantly closer to MIPS and RISC-V compared to ARM32. Though it's still using condition codes where MIPS and RISC-V just have conditional branch instructions.

anthk · 2025-06-06T23:11:53 1749251513

MIPS was used in the PSX and the N64 among the SGI workstations of its day, and the PSP too. Pretty powerful per cycle.

dehrmann · 2025-06-06T16:46:29 1749228389

I'm far from an expert here, but these days, it's better to view the instruction set as a frontend to the actual implementation. You can see this with Intel's E/P cores; the instructions are the same, but the chips are optimized differently.

There actually have been changes for "today's needs," and they're usually things like AES acceleration. ARM tried to run Java natively with Jazelle, but it's still best to think of it as a frontend, and the fact that Android is mostly Java and ARM, but this feature got dropped says a lot.

The fact that there haven't been that many changes shows they got the fundamental operations and architecture styles right. What's lacking today is where GPUs step in: massively wide SIMD.

inkyoto · 2025-06-07T03:45:14 1749267914

There are not really any newer instruction sets as we are locked into the von Neumann architecture and, until we move away from it, we will continue to move data between memory and CPU registers, or registers ↭ registers etc, which means that we will continue to add, shift, test conditions of arithmetic operations – same instructions across pretty much any CPU architecture relevant today.

So we have:

  CISC – which is still used outside the x86 bubble;

  RISC – which is widely used;

  Hybrid RISC/CISC designs – x86 excluding, that would be the IBM z/Architecture (i.e. mainframes);

  EPIC/VLIW – which has been largely unsuccessful outside DSP's and a few niches.

They all deal with registers, movements and testing the conditions, though, and one can't say that an ISA 123 that effectively does the same thing as an ISA 456 is older or newer. SIMD instructions have been the latest addition, and they also follow the same well known mental and compute models.

Radically different designs, such as Intel APX 432, Smalltalk, Java CPU's, have not received any meaningful acceptance, and it seems that the idea of a CPU architecture that is tied to a higher level compute model has been eschewed in perpetuity. Java CPU's were the last massively hyped up attempt to change it, and that was 30 years ago.

What other viable alternatives outside the von Neumann architecture are available to us? I am not sure.

GregarianChild · 2025-06-07T09:30:29 1749288629

Modern GPU instructions are often VLIW and the compiler has to do a lot to schedule them. For example, Nvidia's Volta (from 2017) uses 128-bit to encode each instruction. According to [1], the 128 bits in a word are used as follows:

• at least 91 bits are used to encode the instruction

• at least 23 bits are used to encode control information associated to multiple instructions

• the remaining 14 bits appeared to be unused

AMD GPUs are similar, I believe. VLIW is good for instruction density. VLIW was unsuccessful in CPUs like Itanium because the compiler was expected to handle (unpredictable) memory access latency. This is not possible, even today, for largely sequential workloads. But GPUs typically run highly parallel workload (e.g. MatMul), and the dynamic scheduler can just 'swap out' threads that wait for memory loads. Your GPU will also perform terribly on highly sequential workloads.

[1] Z. Jia, M. Maggioni, B. Staiger, D. P. Scarpazza, Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking. https://arxiv.org/abs/1804.06826

inkyoto · 2025-06-08T10:19:22 1749377962

Personally, I have a soft spot for VLIW/EPIC architectures, and I really wish they were more successful in the mainstream computing.

I didn't consider GPU's precisely for the reason you mentioned – because of their unsuitability to run sequential workloads, which is most applications that end users run, even though nearly every modern computing contraption in existence has them today.

One, most assuredly, radical departure from the von Neumann architecture that I completely forgot about is the dataflow CPU architecture, which is vastly different from what we have been using in the last 60+ years. Even though there have been no productionised general purpose dataflow CPU's, it has been successfully implemented for niche applications, mostly in the networking. So, circling back to the original point raised, dataflow CPU instructions would certainly qualify for a new design.

GregarianChild · 2025-06-08T14:51:34 1749394294

The reason that VLIW/EPIC architectures have not been successful that for mainstream workloads is the combination of

• the "memory wall",

• the static unpredictability of memory access, and

• the lack of sufficient parallelism for masking latency.

Those make dynamically scheduling instructions is just much more efficient.

Dataflow has been tried many many many times for general-purposed workloads. And every time it failed for general-purposed workloads. In the early 2020s I was part of an expensive team doing a blank-slate dataflow architecture for a large semi company: the project got cancelled b/c the performance figures were weak relative to the complexity of micro-architecture, which was high (hence expensive verification and high area). As one of my colleagues on that team says: "Everybody wants to work on dataflow until he works on dataflow." Regarding history of dataflow architectures, [1] is from 1975, so half a century old this year.

[1] J. Dennis, A Preliminary Architecture for a Basic Data-Flow Processor https://courses.cs.washington.edu/courses/cse548/11au/Dennis...

scroot · 2025-06-08T02:10:54 1749348654

Have you heard about ufork? I find it very promising

https://github.com/organix/uFork

inkyoto · 2025-06-08T11:02:35 1749380555

Nope, not until now. It seems to be a much more modern take on the idea of an object oriented CPU architecture.

Yet, there is something about object oriented ISA's that has made CPU designers eschew them consistently. Ranging from the Intel iAPX-432, to the Japanese Smalltalk Katana CPU, to jHISC, to another, unrelated, Katana CPU by the University of Texas and the University of Illinois, none of them have ever yielded a mainstream OO CPU. Perhaps, modern computing is not very object oriented after all.

[0] https://github.com/organix/uFork/blob/main/docs/asm.md

ksec · 2025-06-06T16:46:21 1749228381

>I've always wondered if there could be a new instruction set, better suited to today's needs.

AArach64 is pretty much a completely new ISA built from ground up.

https://www.realworldtech.com/arm64/5/

ItCouldBeWorse · 2025-06-06T16:35:50 1749227750

I think the ideal would be something like a Xilinx offering, tailoring the CPU- regarding cache, parallelism and in hardware execution of hotloop components, depending on the task.

Your CPU changes with every app, tab and program you open. Changing from one core, to n-core plus AI-GPU and back. This idea, that you have to write it all in stone, always seemed wild to me.

dehrmann · 2025-06-06T16:49:40 1749228580

I'm fuzzy on how FPGAs actually work, but they're heavier weight than you think, so I don't think you'd necessarily get the wins you're imagining.

FuriouslyAdrift · 2025-06-06T18:15:20 1749233720

You should definitely look into AMD's Instict, Xynq, and Versal lines, then.

dmitrygr · 2025-06-07T03:28:31 1749266911

> As someone who knows almost nothing about CPU architecture, I've always wondered if there could be a new instruction set, better suited to today's needs.

It exists, and was specifically designed to go wide since clock speeds have limits, bit ILP can be scaled almost infinitely if you are willing to put enough transistors into it. aarch64

bluesounddirect · 2025-06-06T22:45:59 1749249959

https://archive.ph/BSKSq

ahartmetz · 2025-06-06T15:18:22 1749223102

I don't know, RISC-V doesn't seem to be very disruptive at this point? And what's the deal with specialized chips that the article mentions? Today, the "biggest, baddest" CPUs - or at least CPU cores - are the general-purpose (PC and, somehow, Apple mobile / tablet) ones. The opposite of specialized.

Are they going to make one with 16384 cores for AI / graphics or are they going to make one with 8 / 16 / 32 cores that can each execute like 20 instructions per cycle?

jasoneckert · 2025-06-06T15:20:51 1749223251

Most of the work that goes into chip design isn't related to the ISA per se. So, it's entirely plausible that some talented chip engineers could design something that implements RISC-V in a way that is quite powerful, much like how Apple did with ARM.

The biggest roadblock would be lack of support on the software side.

lugu · 2025-06-06T15:50:51 1749225051

On the software side, I think the biggest blocker is affordable UEFI laptop for developers. A risc-v startup aiming to disrupt the cloud should include this in their master plan.

ahartmetz · 2025-06-06T15:34:10 1749224050

Yeah sure, but the question remains if it's going to be a huge amount of small cores or a moderate amount of huge cores.

What it can't be is something like the Mill if they implement the RISC-V ISA.

leetrout · 2025-06-06T17:26:23 1749230783

For those that don't know about the Mill see https://millcomputing.com/

I came to this thread looking for a comment about this. I've been patiently following along for over a decade now and I'm not optimistic anything will come from the project :(

ahartmetz · 2025-06-06T20:18:27 1749241107

Yeah, I guess not at this point, but the presentations were very interesting to watch. According to the yearly(!) updates on their website, they are still going but not really close to finishing a product. Hm.

pabs3 · 2025-06-09T07:06:42 1749452802

Last thing I read said they need investment to be able to get much further.

mixmastamyk · 2025-06-06T15:39:48 1749224388

Article implies a CPU focus at first, though is a bit vague. Title is clear however.

trollbridge · 2025-06-06T17:09:03 1749229743

Doesn't RISC-V have a fairly reasonable ecosystem by now?

inkyoto · 2025-06-07T04:22:13 1749270133

It has been catching up but is still inadequate, at least from the compiler optimisation perspective.

The lack of high-performance RISC-V designs means that C/C++ compilers produce all-around good but generic code that can run on most RISC-V CPU's, from microcontrollers to a few commercially available desktops or laptops, but it can't exploit high-performance CPU design features of a specific CPU (e.g. exploit instruction timings or specific instruction sequences recommended for each generation). The real issue is that the high-performant RISC-V designs are yet to emerge.

Producing a highly performant CPU is only one part of the job, and the next part requires compiler support, which can't exist unless the vendor publishes extensive documentation that explains how to get the most out of it.

glookler · 2025-06-06T15:36:43 1749224203

If RISC-V started with supercomputing and worked down that would be a change from how an architecture disruption usually works in the industry.

constantcrying · 2025-06-06T15:34:58 1749224098

The article is so bad. Why do they refuse to say anything about what these companies are actually trying to make. RISC-V Chips exist, does the journalist just not know? Does the company refuse to say what they are doing?

muricula · 2025-06-06T16:28:37 1749227317

The article is written for a different audience than you might be used to. oregonlive is the website for the newspaper The Oregonian, which is the largest newspaper in the state of Oregon. Intel has many of its largest fabs in Oregon and is a big employer there. The local news is writing about a hip new startup for a non-technical audience who know what Intel is and why it's important, but need to be reminded what a CPU actually is.

Ericson2314 · 2025-06-06T19:22:14 1749237734

TBH this is a bad sign about job sprawl.

The fact that California housing pushed Intel to Oregon probably helped lead to its failures. Every time a company relocates to get cost of living (and thus payroll) costs down by relocating to a place with fewer potential employees and fewer competing employers, modernity slams on the breaks.

muricula · 2025-06-06T21:55:11 1749246911

That might have been true in the early 2000s when they were growing the Hillsborough Oregon campus but most new fabs are opening in Arizona for taxation and political reasons. I don't have the numbers to back it up, but based on articles about Intel layoffs I believe that Intel has been shedding jobs in Oregon for a while now.

This wiki page has a list of Intel fab starts, you can see them being constructed in Oregon until 2013, and after that all new construction moved elsewhere. https://en.wikipedia.org/wiki/List_of_Intel_manufacturing_si...

I can imagine this slow disinvestment in Oregon would only encourage some architects to quit an found a RISC-V startup.

Ericson2314 · 2025-06-06T22:15:53 1749248153

I am saying that all this stuff should have never left the bay area, and the bay area should have millions more people than it does today.

Arizona is also a mistake --- a far worse place for high tech than Oregon!. It is a desert real estate ponzi scheme with no top-tier schools, no history of top-tier high-skill intellectual job markets. In general the sun belt (including LA) is the land of stupid.

The electoral college is always winning out over the best economic geography, and it sucks.

Ericson2314 · 2025-06-06T19:24:54 1749237894

https://www.aheadcomputing.com/post/everyone-deserves-a-bett... sheesh, even the company's own writing is kinda folksy too.

pragma_x · 2025-06-06T15:47:40 1749224860

It reads like they're trying to drum up investment. This is why the focus is on the pedigree of the founders, since they don't have a product to speak of yet.

whobre · 2025-06-07T16:34:30 1749314070

They should have named it Zilog…

1970-01-01 · 2025-06-06T15:38:08 1749224288

Staring at current AI chip demand levels and choosing to go with RISC chips is the boldest move you could make. Good luck. The competition with the big boys will be relentless. I expect them to be bought if they actually make a dent in the market.

neuroelectron · 2025-06-06T18:08:55 1749233335

Tldr: RISC-V ASICs