Hacker News new | past | comments | ask | show | jobs | submit login
Clocking a 6502 simulator to 15GHz (scarybeastsecurity.blogspot.com)
176 points by scarybeast on April 13, 2020 | hide | past | favorite | 74 comments



I was told by Leonard Tramiel (who was my manager at Atari for a while) that the world record for a production 6502 was 25Mhz. This was demonstrated one Friday evening, some time after the beer fridge had been opened in one of the labs.

I don't know if they applied any kind of external cooling, or what the benchmark was. Probably it was "keep cranking up the clock until pins stop wiggling or smoke comes out." Not very scientific, but quite entertaining.


One of my old companies (InformASic) developed a VPN solution for serial communication. The product LinkShield (later renamed and spun off to form CrypTango) was implemented as a small ASIC. The main CPU was a 6502 clone with memory protection. We clocked it at 33 MHz, but usually ran them at 25 MHz in the products. That 6502 clone was cycle correc, that is the number of cycles required for an instruction was the same as for the original MOS 6502.

Nowdays you can quite easily to a 6502 implementation in a FPGA running at 100 MHz. Esp if you allow the design to use more cycles for some instructions.

Sadly the product never took off and the companies folded. I have some chips somewhere. Googling at least revealed a picture of the product:

https://www.google.com/imgres?imgurl=https%3A%2F%2Ffarm3.sta...


From the Commodore book, about the early 80s:

"We actually made a couple of really hot processors for a chess tournament for somebody. He literally water-cooled it, and he ran it at something like eight megahertz. It was just ridiculous how fast he ran it."

Earlier it was explained that some processors coming off the production line could run faster than others, and they could test for it to pick the best ones for such purposes. They didn't end up increasing the clock speed for released computers, as other components could not keep up.


There's an FPGA 65C02 core running at ~73Mhz. https://github.com/MorrisMA/MAM65C02-Processor-Core


18 MHz, actually -- the FPGA clock speed is 73 MHz, but it executes the equivalent of one 6502 clock cycle in four of its clocks.

That being said, this was implemented on a budget-line FPGA from 2006 (XC3S50A - a small Xilinx Spartan-3A). A modern performance-line FPGA would probably hit a couple hundred MHz easily.


It's still surprising how little relatively to the level of perceived performance have IPC count improved since seventies.


It seems to me that computational throughput has improved exponentially, but latency for input tasks etc. has in fact worsened in many cases.


Per TFA IPC has improved by about factor of 5, assuming a 3GHz machine runs BBC micro at equivalent performance of 15GHz


Reality for (SIMD) integer math is probably closer to 1000, floating point... probably 100000.


You're wrong :)

IPCs for 6502 or Z80 (4x "faster" clock but 3-6 cycles per machine cycle) processors were at the count of clock cycles per instruction

Even a measly 386/486 were much faster than that.

Enter the Pentium with the ability to execute 2 instructions in parallel.

IPC count were the big gainers recently as well


It is still only 20-30 fold at max on average. So much more came from many thousand fold clock speed increase, and much wider execution paths.


The perceived performance for an average desktop has been pretty stagnant since, I don't know, at least the mid-nineties.


There was a huge perceived improvement when SSDs first appeared.


Softare expands to fill the CPU available.


AKA Wirth’s Law[1]: “Software is getting slower more rapidly than hardware is becoming faster.”

[1]: https://en.m.wikipedia.org/wiki/Wirth%27s_law


Software is being produced faster though.


Didn't cars also follow that pattern?


A new stock 65c02 from WDC can do 20mhz. So this FPGA version @ 18mhz doesn't sound any better. Though I'm sure on a modern FPGA one can do more than that.


Can anyone with an electronics background explain why it's so hard to clock a 6502 higher than a handful of MHz, when modern chips can do 1000x that? Is it just larger transistor scale leading to excess capacitance / slower switching?


One reason is that: transistor size and switching speed. Though the technology of the 6502 probably could go 50MHz? 100MHz? Not sure. Would it be equivalent to 74HC TTL line? Again not sure

But the main (basic) reason is that the internal logic blocks don't worry too much about processing and arrival times beyond the speed at which they need to operate. What's simultaneous at 1MHz might be not so simultaneous at 10MHz or 100MHz

Another (advanced) reason why overclocking it might be hard is EM interference inside and outside the chip.


The 6502 has very limited pipelining, and every CPU cycle is tied to a memory access with no support for wait states or stalls. At 1 MHz it can work with really slow memory (roughly 500 ns), but at 10 MHz it needs ~60 ns, and at 20 MHz something like ~20ns. The architecture simply wasn't designed for anything above single digit clock speeds.


"until pins stop wiggling"?


The pins don't physically wiggle. "pins wiggling" is a common metaphor for "the voltage level on a pin is changing".

As a signal driver is toggled at increasing frequencies ('cranking up the clock'), the signal amplitude (voltage difference between the 'high' and 'low' period) starts to drop. At a high enough frequency, the signal will be indistinguishable from noise and 'stops wiggling'.


> At a high enough frequency, the signal will be indistinguishable from noise and 'stops wiggling'.

It's not that the signal will be indistinguishable from noise, but that the CPU will stop working correctly, so its outputs will stop toggling (or will toggle in unexpected ways).


I had an impression that they might be talking what a digital I/O pin looks on oscilloscope.

Scopes lock at first rising edge after last horizontal scan, so display starts at H, then drops to L after how long CPU held that pin high. That creates  ̄ ̄l_ lines on the screen superimposed to one another, X position “wiggling”  ̄lll_ depending on how many consecutive H bits just happened to be sent.

When the CPU halts, the pin would flatline at H or L and you’ll know.


I was actually thinking about catastrophic failure, as in something on the chip burning out and making the clock stop.

My guess is that they were probably cooling it with beer (or at least cold beer bottle bottoms) to get that last critical Mhz, before drinking the beer.


Wiggling a pin - ie: toggling it. Basically means "activity on the pins"

https://www.cypress.com/blog/technical/more-pdl-examples-wig...


Euphemism for when you don't see output pins transition state on a measuring instrument, e.g. oscope.


I think it should be “start wiggling” as that makes more sense.


Wiggling pins is what a computer does when it is working.


Maybe the pins got so hot they melted into the board?


I love reading the stories about overclocking attempts (actually that scene doesn't seem to be much of a thing anymore?), people bolting pipes on top of CPU's and filling them with liquid nitrogen, nearly supercooling the CPU's and breaking speed records.


It's an interesting article, but...

Better title: Clocking a 6502 Simulator to 15 GHz. There are multiple efforts to recreate the physical 6502 CPU on modern hardware, this is not one of them and should not be confused with that.


I was a bit confused and expected to see some elaborate liquid cooling nonsense to get the poor chip up to 15 GHz.


Ok, we've put a simulator in the title above.


I realize I'm late to the party, but I've really been enjoying Ben Eater's series on building a simple computer with a 6502.

https://www.youtube.com/playlist?list=PLowKtXNTBypFbtuVMUVXN...


Yes, it's awesome.


After stumbling on Ben Eaters “Hello world from scratch” [1] I went out and bought the cpu some parts and breadboards. The chip is only a few dollars. It is highly recommended if you want to dive down into computers and digital logic on first principles. Also great fun to get a break from all the screens and layers upon layers of software that I have to deal with daily.

1. https://youtu.be/LnzuMJLZRdU


Kind of plugging my own project here, but I too am a software developer that found great joy from breaking away from all the layers of abstraction and working directly with the hardware. I created a portable game system with the 6502:

http://www.dodolabs.io/


This is awesome!


For more "very fast simple CPU" architecture, see 50,000,000,000 Instructions Per Second: Design and Implementation of a 256-Core BrainFuck Computer: https://people.csail.mit.edu/wjun/papers/sigtbd16.pdf


Huh... I hadn't considered it before, but Bender's brain could actually be a 6502, just being run at an insanely high clock speed. A few petahertz should be able to handle the AI involved, no?

Planck time is like 10^-43 seconds, so there's lots of room to divvy up a second for more processing power given advanced technologies...


If the hardware is advanced enough to do that, the AI software is similarly advanced, and a basic 6502 can produce a Bender like AI without breaking a sweat!


More advanced software would with all likelihood require significant calculations, rendering a basic 6502 useless.


Yeah... Maybe we'd need to bump up the clock to exahertz to make up for the loss of precision and constant memory access. The top super computer is already at 148 petaflops, so we'd need some more headroom for general AI.

Of course, petaherz (10^15 cycles per second) is already the speed at which an electron circles around a hydrogen atom, so we may not be able to use electricity any more...


It would be interesting to compare this project to simply converting 6502 assembly into LLVM IR, and letting clangs optimization passes work their magic.

Obviously self modifying code would be hard to handle, but every other case ought to work, and the auto-vectorization ought to do amazing things to some loop-heavy code.


I was wondering about this. Does LLVM offer facilities expressive enough to model things like self-modification and arbitrary interruptability?

I agree it'd be wonderful to see auto-vectorization! Obviously, 6502 code does things 1 byte at a time so even adding 32-bit integers is painful. Auto-upgrade of those loops to 32-bit variants would be amazing.


Self modifying code was the norm. I hated the loss of optimizations you could do when going to a system on non-self-modifying code. I ended up just giving up on assembly.


I remember thinking, "oh? this was made by people who can't code!?" then I quit entirely.

Before, stuff would reduce to nothing after use. Will this condition continue to change? Will it stay true? (Remove the check) Will it stay false? (remove that chunk of code) and then remove this.

Sure, it was fun. More important: People wrote things that were truly impressive. Writing something that worked was only the beginning.

Until compilers know which buttons are used most frequently they cant fully optimize. Who knows, maybe one day windos will give me a start menu and allow me to reboot when the application freezes up? Maybe one day text input will have some priority? The bare basics basically?


a 6502 backend for LLVM has been attempted a couple times[1][2], but the fact that the 6502 only has three registers imposes severe limitations w.r.t. LLVM's calling conventions.

[1] https://github.com/c64scene-ar/llvm-6502

[2] https://github.com/beholdnec/llvm-m6502


Other way around - the idea is to create an LLVM frontend for 6502 machine code, and then transpile that code to run on a modern CPU architecture.


That sounds like an interesting project!

You would probably want to add some tricks directly there, maybe register renaming (I don't know if LLVM does "variable renaming", let's put it this way)


That's the approach taken by https://github.com/libcpu/libcpu. It supports 6502 and several other CPUs.


I would think you’d need to write a couple of new optimization passes, for example to detect multiplications (in various variants, such as “signed 8 bit to 16 bit, multiplier is 13” or “unsigned 16 bit times 8 bit”) and convert them to LLVM multiplication instructions.

There also is the trick where “BIT” instructions are used to give a function multiple entry points, and that BIT instruction can also be a LDA# (https://retrocomputing.stackexchange.com/a/11132)

I’m not sure that can “simply” be converted to LLVM IR.


Is this what you're (roughly) referring to?:

  https://andrewkelley.me/post/jamulator.html
Although the programmer's target is an entire system, the conclusion may still apply:

  > There is a constant struggle between correctness and optimized code. Nearly all
  > optimizations must be tossed out the window in the interest of correctness


This but at runtime.


I’m not sure: LLVM’s optimizer is fairly slow and there’s a lot of bad things going on that will constantly cause deoptimizations. Maybe for some of the hottest paths?


GeOS would have been much more responsive...


To solve the FF page wrapping problem, I wonder if it would work to double-map each 6502 page to x64 host pages side by side. I assume the word read at FF would straddle the two mapped pages effectively reading the second byte at 00. You'd have to map to host page boundaries of course and probably offset all reads/writes to the end of the host page at $3F00.


I believe VMware without hardware support for virtualisation also falls back to "binary translation" and similarly gets tripped by SMC - I don't recall the details right now but one of the ways to detect it was to modify an instruction in an obscure way that the developers had forgotten about.


I want to see a 6502-alike clocked to 15GHz with an exotic semiconductor such as GaAs, InP, SiGe, etc.


Probably wouldn't be able to see it with current fabrication technologies and historic transistor counts.


Exotic materials, other than maybe SiGe use fabrication techniques less advanced than Si, and the transistor counts are much less.

The department of defense funded an SBIR grant in the late 1990s to produce an InP based microprocessor, given the limits of the time it would have been closer to a 6502 than a Pentium. There has not been word of such a thing since which leads me to conclude that the topic is classified.

The worst limitation a 6502-era chip has is that it has no instruction cache so instruction reads are fighting with data for memory bandwidth. You might even consider a Harvard architecture where the instructions go on a different bus. Without an I-Cache there is no point in pipelining, but there is a lot of pressure to implement CISCy instructions such as the string copy operation from the 8086 line.

The other issue is that there is no DRAM replacement with exotic materials, and all the difficulties with interconnect latency get a lot worse than they already are. It's more clear how to make SRAM, so having somewhere between 64K to 1Mbytes of SRAM on die seems likely for an exotic material CPU.

Of course, armchair CPU designers are more likely to make progress with transition triggered architectures and FPGAs in 2020.


Now someone needs to write an x86 emulator in 6502 asm and boot windows.


The Mega65 runs a 6502 at 50Mhz compatible with the Commodore 65 plus C64 mode. http://mega65.org


I'm not 100% sure where the 15 GHz equivalent speed calculation comes from.


The author showed it at the end of the article. It's the "effective speed" reported by some benchmark programs (including calling subroutines, running for loops, iterating on a string, etc). These are simple and trivial programs and can be highly optimized in a simulator on modern x86_64. Real-world programs, like games, is slower, as acknowledged in the article.


A lot of BBC BASIC programs, doing real work (e.g. Mandelbrot drawing etc.), should have a shot at 10GHz. Games are slower because they are hammering hardware registers external to the JIT (sound, graphics, keyboard polling, timing, etc.)

My laptop is an ancient 5th gen i5 with 2 keys having fallen off, so games are down in the 2GHz - 3GHz range for me. (Perhaps the missing keys make all the difference.)


I understand that some people look suspiciously at the 15GHz mark, specially considering this was run in a 4.5GHz processor. What I understand is that this benchmarks are comparing how long it would've taken on a stock 1Mhz 6502, and calculate the "clock speed" obtained as a ratio. So if I'm getting my result 10,000 times faster than a standard 6502, it means I'm at 10GHz.

I also understand that this is possible because the emulator is running on a superscalar processor. Not sure if multicore has anything to do here (the post specifically mentions the high performance of the single-core case for the processor used). Still, considering that processors back in the 6502 era had just one execution port, and superscalars this day have a lot (I think 8? I really lost track of what's usual these days), then the figure makes sense all right, and without involving any kind of multithreading.

Kudos to the authors of the emulator for having a super-optimized system that can effectively and efficiently emulate its target!


I like the framing here, that of seeing this as a showcase of modern superscalar improvements. And yes, it's about single core performance only.

What is particularly interesting to me is how thoroughly superscalar "wins". Because of complexities with 6502 -> x64 mapping, and handling self-modifying code in particular, some of the most common 6502 instructions explode to multiple x64 instructions. Despite that huge extra instruction load, the translation still manages to run at much greater speed than a 1:1 instruction ratio.

Modern processors do not run on electrons. They run on unicorn tears and magic.


Note that there is also a speedup from dynamic optimization.


> I also understand that this is possible because the emulator is running on a superscalar processor.

It's also possible because the minimal architecture of the 6502 makes it inherently inefficient. With only three 8-bit registers -- which can't even be used interchangeably! -- and a non-addressable stack, a lot of CPU time on the 6502 is spent shuffling data around. Consider adding two 32-bit numbers, for example. On a 6502, this is a minimum of 38 cycles (clc + (lda, adc, sta) x4); an x86 can complete the same operation in one cycle, potentially in parallel with other operations.


X86 ... not x64


The architecture never had a good name. AMD originally called it "x86-64" (but not AFAIK "AMD64", even though lots of other people did), but "x86_64" is most common in the open source world (I guess because the underscore makes it legal as a C symbol). "x64" is what Sun and Microsoft decided to use. Intel has called it "ia32e", "EM64T" and "Intel 64" at various times.

I think this article gets a pass.




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: