Clocking a 6502 simulator to 15GHz

kabdib · on April 13, 2020

I was told by Leonard Tramiel (who was my manager at Atari for a while) that the world record for a production 6502 was 25Mhz. This was demonstrated one Friday evening, some time after the beer fridge had been opened in one of the labs.

I don't know if they applied any kind of external cooling, or what the benchmark was. Probably it was "keep cranking up the clock until pins stop wiggling or smoke comes out." Not very scientific, but quite entertaining.

JoachimS · on April 14, 2020

One of my old companies (InformASic) developed a VPN solution for serial communication. The product LinkShield (later renamed and spun off to form CrypTango) was implemented as a small ASIC. The main CPU was a 6502 clone with memory protection. We clocked it at 33 MHz, but usually ran them at 25 MHz in the products. That 6502 clone was cycle correc, that is the number of cycles required for an instruction was the same as for the original MOS 6502.

Nowdays you can quite easily to a 6502 implementation in a FPGA running at 100 MHz. Esp if you allow the design to use more cycles for some instructions.

Sadly the product never took off and the companies folded. I have some chips somewhere. Googling at least revealed a picture of the product:

https://www.google.com/imgres?imgurl=https%3A%2F%2Ffarm3.sta...

bemmu · on April 14, 2020

From the Commodore book, about the early 80s:

"We actually made a couple of really hot processors for a chess tournament for somebody. He literally water-cooled it, and he ran it at something like eight megahertz. It was just ridiculous how fast he ran it."

Earlier it was explained that some processors coming off the production line could run faster than others, and they could test for it to pick the best ones for such purposes. They didn't end up increasing the clock speed for released computers, as other components could not keep up.

tyingq · on April 13, 2020

There's an FPGA 65C02 core running at ~73Mhz. https://github.com/MorrisMA/MAM65C02-Processor-Core

duskwuff · on April 13, 2020

18 MHz, actually -- the FPGA clock speed is 73 MHz, but it executes the equivalent of one 6502 clock cycle in four of its clocks.

That being said, this was implemented on a budget-line FPGA from 2006 (XC3S50A - a small Xilinx Spartan-3A). A modern performance-line FPGA would probably hit a couple hundred MHz easily.

baybal2 · on April 13, 2020

It's still surprising how little relatively to the level of perceived performance have IPC count improved since seventies.

cmrdporcupine · on April 14, 2020

It seems to me that computational throughput has improved exponentially, but latency for input tasks etc. has in fact worsened in many cases.

aidenn0 · on April 14, 2020

Per TFA IPC has improved by about factor of 5, assuming a 3GHz machine runs BBC micro at equivalent performance of 15GHz

vardump · on April 14, 2020

Reality for (SIMD) integer math is probably closer to 1000, floating point... probably 100000.

raverbashing · on April 14, 2020

You're wrong :)

IPCs for 6502 or Z80 (4x "faster" clock but 3-6 cycles per machine cycle) processors were at the count of clock cycles per instruction

Even a measly 386/486 were much faster than that.

Enter the Pentium with the ability to execute 2 instructions in parallel.

IPC count were the big gainers recently as well

baybal2 · on April 14, 2020

It is still only 20-30 fold at max on average. So much more came from many thousand fold clock speed increase, and much wider execution paths.

nikanj · on April 14, 2020

The perceived performance for an average desktop has been pretty stagnant since, I don't know, at least the mid-nineties.

ajuc · on April 14, 2020

There was a huge perceived improvement when SSDs first appeared.

spullara · on April 14, 2020

Softare expands to fill the CPU available.

woadwarrior01 · on April 14, 2020

AKA Wirth’s Law[1]: “Software is getting slower more rapidly than hardware is becoming faster.”

[1]: https://en.m.wikipedia.org/wiki/Wirth%27s_law

FeepingCreature · on April 14, 2020

Software is being produced faster though.

6510 · on April 14, 2020

Didn't cars also follow that pattern?

cmrdporcupine · on April 13, 2020

A new stock 65c02 from WDC can do 20mhz. So this FPGA version @ 18mhz doesn't sound any better. Though I'm sure on a modern FPGA one can do more than that.

QuadrupleA · on April 14, 2020

Can anyone with an electronics background explain why it's so hard to clock a 6502 higher than a handful of MHz, when modern chips can do 1000x that? Is it just larger transistor scale leading to excess capacitance / slower switching?

raverbashing · on April 14, 2020

One reason is that: transistor size and switching speed. Though the technology of the 6502 probably could go 50MHz? 100MHz? Not sure. Would it be equivalent to 74HC TTL line? Again not sure

But the main (basic) reason is that the internal logic blocks don't worry too much about processing and arrival times beyond the speed at which they need to operate. What's simultaneous at 1MHz might be not so simultaneous at 10MHz or 100MHz

Another (advanced) reason why overclocking it might be hard is EM interference inside and outside the chip.

MagerValp · on April 14, 2020

The 6502 has very limited pipelining, and every CPU cycle is tied to a memory access with no support for wait states or stalls. At 1 MHz it can work with really slow memory (roughly 500 ns), but at 10 MHz it needs ~60 ns, and at 20 MHz something like ~20ns. The architecture simply wasn't designed for anything above single digit clock speeds.

paulmd · on April 13, 2020

"until pins stop wiggling"?

variaga · on April 13, 2020

The pins don't physically wiggle. "pins wiggling" is a common metaphor for "the voltage level on a pin is changing".

As a signal driver is toggled at increasing frequencies ('cranking up the clock'), the signal amplitude (voltage difference between the 'high' and 'low' period) starts to drop. At a high enough frequency, the signal will be indistinguishable from noise and 'stops wiggling'.

duskwuff · on April 13, 2020

> At a high enough frequency, the signal will be indistinguishable from noise and 'stops wiggling'.

It's not that the signal will be indistinguishable from noise, but that the CPU will stop working correctly, so its outputs will stop toggling (or will toggle in unexpected ways).

numpad0 · on April 14, 2020

I had an impression that they might be talking what a digital I/O pin looks on oscilloscope.

Scopes lock at first rising edge after last horizontal scan, so display starts at H, then drops to L after how long CPU held that pin high. That creates ￣￣l＿ lines on the screen superimposed to one another, X position “wiggling” ￣lll＿ depending on how many consecutive H bits just happened to be sent.

When the CPU halts, the pin would flatline at H or L and you’ll know.

kabdib · on April 14, 2020

I was actually thinking about catastrophic failure, as in something on the chip burning out and making the clock stop.

My guess is that they were probably cooling it with beer (or at least cold beer bottle bottoms) to get that last critical Mhz, before drinking the beer.

mmastrac · on April 13, 2020

Wiggling a pin - ie: toggling it. Basically means "activity on the pins"

https://www.cypress.com/blog/technical/more-pdl-examples-wig...

metaphor · on April 13, 2020

Euphemism for when you don't see output pins transition state on a measuring instrument, e.g. oscope.

justwalt · on April 13, 2020

I think it should be “start wiggling” as that makes more sense.

brokenmachine · on April 14, 2020

Wiggling pins is what a computer does when it is working.

xentripetal · on April 13, 2020

Maybe the pins got so hot they melted into the board?

Cthulhu_ · on April 14, 2020

I love reading the stories about overclocking attempts (actually that scene doesn't seem to be much of a thing anymore?), people bolting pipes on top of CPU's and filling them with liquid nitrogen, nearly supercooling the CPU's and breaking speed records.

segfaultbuserr · on April 13, 2020

It's an interesting article, but...

Better title: Clocking a 6502 Simulator to 15 GHz. There are multiple efforts to recreate the physical 6502 CPU on modern hardware, this is not one of them and should not be confused with that.

arriu · on April 13, 2020

I was a bit confused and expected to see some elaborate liquid cooling nonsense to get the poor chip up to 15 GHz.

dang · on April 13, 2020

Ok, we've put a simulator in the title above.

JshWright · on April 13, 2020

I realize I'm late to the party, but I've really been enjoying Ben Eater's series on building a simple computer with a 6502.

https://www.youtube.com/playlist?list=PLowKtXNTBypFbtuVMUVXN...

louwrentius · on April 13, 2020

Yes, it's awesome.

halotrope · on April 13, 2020

After stumbling on Ben Eaters “Hello world from scratch” [1] I went out and bought the cpu some parts and breadboards. The chip is only a few dollars. It is highly recommended if you want to dive down into computers and digital logic on first principles. Also great fun to get a break from all the screens and layers upon layers of software that I have to deal with daily.

1. https://youtu.be/LnzuMJLZRdU

dodo6502 · on April 13, 2020

Kind of plugging my own project here, but I too am a software developer that found great joy from breaking away from all the layers of abstraction and working directly with the hardware. I created a portable game system with the 6502:

http://www.dodolabs.io/

halotrope · on April 13, 2020

This is awesome!

__s · on April 14, 2020

For more "very fast simple CPU" architecture, see 50,000,000,000 Instructions Per Second: Design and Implementation of a 256-Core BrainFuck Computer: https://people.csail.mit.edu/wjun/papers/sigtbd16.pdf

russellbeattie · on April 14, 2020

Huh... I hadn't considered it before, but Bender's brain could actually be a 6502, just being run at an insanely high clock speed. A few petahertz should be able to handle the AI involved, no?

Planck time is like 10^-43 seconds, so there's lots of room to divvy up a second for more processing power given advanced technologies...

gregoryl · on April 14, 2020

If the hardware is advanced enough to do that, the AI software is similarly advanced, and a basic 6502 can produce a Bender like AI without breaking a sweat!

hvidgaard · on April 14, 2020

More advanced software would with all likelihood require significant calculations, rendering a basic 6502 useless.

russellbeattie · on April 14, 2020

Yeah... Maybe we'd need to bump up the clock to exahertz to make up for the loss of precision and constant memory access. The top super computer is already at 148 petaflops, so we'd need some more headroom for general AI.

Of course, petaherz (10^15 cycles per second) is already the speed at which an electron circles around a hydrogen atom, so we may not be able to use electricity any more...

londons_explore · on April 13, 2020

It would be interesting to compare this project to simply converting 6502 assembly into LLVM IR, and letting clangs optimization passes work their magic.

Obviously self modifying code would be hard to handle, but every other case ought to work, and the auto-vectorization ought to do amazing things to some loop-heavy code.

scarybeast · on April 13, 2020

I was wondering about this. Does LLVM offer facilities expressive enough to model things like self-modification and arbitrary interruptability?

I agree it'd be wonderful to see auto-vectorization! Obviously, 6502 code does things 1 byte at a time so even adding 32-bit integers is painful. Auto-upgrade of those loops to 32-bit variants would be amazing.

sroussey · on April 13, 2020

Self modifying code was the norm. I hated the loss of optimizations you could do when going to a system on non-self-modifying code. I ended up just giving up on assembly.

6510 · on April 14, 2020

I remember thinking, "oh? this was made by people who can't code!?" then I quit entirely.

Before, stuff would reduce to nothing after use. Will this condition continue to change? Will it stay true? (Remove the check) Will it stay false? (remove that chunk of code) and then remove this.

Sure, it was fun. More important: People wrote things that were truly impressive. Writing something that worked was only the beginning.

Until compilers know which buttons are used most frequently they cant fully optimize. Who knows, maybe one day windos will give me a start menu and allow me to reboot when the application freezes up? Maybe one day text input will have some priority? The bare basics basically?

woodrowbarlow · on April 13, 2020

a 6502 backend for LLVM has been attempted a couple times[1][2], but the fact that the 6502 only has three registers imposes severe limitations w.r.t. LLVM's calling conventions.

[1] https://github.com/c64scene-ar/llvm-6502

[2] https://github.com/beholdnec/llvm-m6502

azernik · on April 13, 2020

Other way around - the idea is to create an LLVM frontend for 6502 machine code, and then transpile that code to run on a modern CPU architecture.

raverbashing · on April 14, 2020

That sounds like an interesting project!

You would probably want to add some tricks directly there, maybe register renaming (I don't know if LLVM does "variable renaming", let's put it this way)

abbeyj · on April 14, 2020

That's the approach taken by https://github.com/libcpu/libcpu. It supports 6502 and several other CPUs.

Someone · on April 14, 2020

I would think you’d need to write a couple of new optimization passes, for example to detect multiplications (in various variants, such as “signed 8 bit to 16 bit, multiplier is 13” or “unsigned 16 bit times 8 bit”) and convert them to LLVM multiplication instructions.

There also is the trick where “BIT” instructions are used to give a function multiple entry points, and that BIT instruction can also be a LDA# (https://retrocomputing.stackexchange.com/a/11132)

I’m not sure that can “simply” be converted to LLVM IR.

pizza234 · on April 13, 2020

Is this what you're (roughly) referring to?:

  https://andrewkelley.me/post/jamulator.html

Although the programmer's target is an entire system, the conclusion may still apply:

  > There is a constant struggle between correctness and optimized code. Nearly all
  > optimizations must be tossed out the window in the interest of correctness

saagarjha · on April 14, 2020

This but at runtime.

saagarjha · on April 14, 2020

I’m not sure: LLVM’s optimizer is fairly slow and there’s a lot of bad things going on that will constantly cause deoptimizations. Maybe for some of the hottest paths?

zentiggr · on April 13, 2020

GeOS would have been much more responsive...

jsd1982 · on April 14, 2020

To solve the FF page wrapping problem, I wonder if it would work to double-map each 6502 page to x64 host pages side by side. I assume the word read at FF would straddle the two mapped pages effectively reading the second byte at 00. You'd have to map to host page boundaries of course and probably offset all reads/writes to the end of the host page at $3F00.

userbinator · on April 13, 2020

I believe VMware without hardware support for virtualisation also falls back to "binary translation" and similarly gets tripped by SMC - I don't recall the details right now but one of the ways to detect it was to modify an instruction in an obscure way that the developers had forgotten about.

PaulHoule · on April 13, 2020

I want to see a 6502-alike clocked to 15GHz with an exotic semiconductor such as GaAs, InP, SiGe, etc.

undersuit · on April 14, 2020

Probably wouldn't be able to see it with current fabrication technologies and historic transistor counts.

PaulHoule · on April 14, 2020

Exotic materials, other than maybe SiGe use fabrication techniques less advanced than Si, and the transistor counts are much less.

The department of defense funded an SBIR grant in the late 1990s to produce an InP based microprocessor, given the limits of the time it would have been closer to a 6502 than a Pentium. There has not been word of such a thing since which leads me to conclude that the topic is classified.

The worst limitation a 6502-era chip has is that it has no instruction cache so instruction reads are fighting with data for memory bandwidth. You might even consider a Harvard architecture where the instructions go on a different bus. Without an I-Cache there is no point in pipelining, but there is a lot of pressure to implement CISCy instructions such as the string copy operation from the 8086 line.

The other issue is that there is no DRAM replacement with exotic materials, and all the difficulties with interconnect latency get a lot worse than they already are. It's more clear how to make SRAM, so having somewhere between 64K to 1Mbytes of SRAM on die seems likely for an exotic material CPU.

Of course, armchair CPU designers are more likely to make progress with transition triggered architectures and FPGAs in 2020.

tasty_freeze · on April 14, 2020

Now someone needs to write an x86 emulator in 6502 asm and boot windows.

orionblastar · on April 14, 2020

The Mega65 runs a 6502 at 50Mhz compatible with the Commodore 65 plus C64 mode. http://mega65.org

fortran77 · on April 13, 2020

I'm not 100% sure where the 15 GHz equivalent speed calculation comes from.

segfaultbuserr · on April 13, 2020

The author showed it at the end of the article. It's the "effective speed" reported by some benchmark programs (including calling subroutines, running for loops, iterating on a string, etc). These are simple and trivial programs and can be highly optimized in a simulator on modern x86_64. Real-world programs, like games, is slower, as acknowledged in the article.

scarybeast · on April 13, 2020

A lot of BBC BASIC programs, doing real work (e.g. Mandelbrot drawing etc.), should have a shot at 10GHz. Games are slower because they are hammering hardware registers external to the JIT (sound, graphics, keyboard polling, timing, etc.)

My laptop is an ancient 5th gen i5 with 2 keys having fallen off, so games are down in the 2GHz - 3GHz range for me. (Perhaps the missing keys make all the difference.)

dr_zoidberg · on April 13, 2020

I understand that some people look suspiciously at the 15GHz mark, specially considering this was run in a 4.5GHz processor. What I understand is that this benchmarks are comparing how long it would've taken on a stock 1Mhz 6502, and calculate the "clock speed" obtained as a ratio. So if I'm getting my result 10,000 times faster than a standard 6502, it means I'm at 10GHz.

I also understand that this is possible because the emulator is running on a superscalar processor. Not sure if multicore has anything to do here (the post specifically mentions the high performance of the single-core case for the processor used). Still, considering that processors back in the 6502 era had just one execution port, and superscalars this day have a lot (I think 8? I really lost track of what's usual these days), then the figure makes sense all right, and without involving any kind of multithreading.

Kudos to the authors of the emulator for having a super-optimized system that can effectively and efficiently emulate its target!

scarybeast · on April 13, 2020

I like the framing here, that of seeing this as a showcase of modern superscalar improvements. And yes, it's about single core performance only.

What is particularly interesting to me is how thoroughly superscalar "wins". Because of complexities with 6502 -> x64 mapping, and handling self-modifying code in particular, some of the most common 6502 instructions explode to multiple x64 instructions. Despite that huge extra instruction load, the translation still manages to run at much greater speed than a 1:1 instruction ratio.

Modern processors do not run on electrons. They run on unicorn tears and magic.

saagarjha · on April 14, 2020

Note that there is also a speedup from dynamic optimization.

duskwuff · on April 13, 2020

> I also understand that this is possible because the emulator is running on a superscalar processor.

It's also possible because the minimal architecture of the 6502 makes it inherently inefficient. With only three 8-bit registers -- which can't even be used interchangeably! -- and a non-addressable stack, a lot of CPU time on the 6502 is spent shuffling data around. Consider adding two 32-bit numbers, for example. On a 6502, this is a minimum of 38 cycles (clc + (lda, adc, sta) x4); an x86 can complete the same operation in one cycle, potentially in parallel with other operations.

RoutinePlayer · on April 13, 2020

X86 ... not x64

ajross · on April 13, 2020

The architecture never had a good name. AMD originally called it "x86-64" (but not AFAIK "AMD64", even though lots of other people did), but "x86_64" is most common in the open source world (I guess because the underscore makes it legal as a C symbol). "x64" is what Sun and Microsoft decided to use. Intel has called it "ia32e", "EM64T" and "Intel 64" at various times.

I think this article gets a pass.