I've honestly been waiting for this since for YEARS.
I've had a vision for "the future of computing". FPGAs that reconfigure themselves (<-these exist) to become whatever macro-level hardware assets your computer needs. Running tons of SHA-256 encryptions per second? The CPU/OS/OpenSSL (or whatever) detects this condition and switches from hand-coded ones, to an IP core that comes with the CPU. The CPU flashes the FPGA's to become SHA-256 "CoreS" and now you're running 4096x the output with less heat. (As CPUs are designed for one thing "few, large, complex cases" while FPGAs are perfect for many, parallel simple cases" even more than a GPU.) Now, you shutdown your encoding and switch to video encoding, or Doom 2019 and your CPU reflashes (Alteria specialized in PARTIALLY flashable FPGAs so you don't have to nuke the entire FPGA, only sections) and adds cores for video, or physics, or "shader units".
This would be hard for a single person, but any large company could handle making this. You can even do it with off-the-shelf FPGAs. The biggest problems are 1) bandwidth. The "macro" function size needs to be bigger than the latency hit you take for asking the FPGA over computing it internally. (Intel's on-CPU FPGA would be insanely fast access.) and the other one is 2) How do you get people to use it! The simplist is of course, only supporting people who actually request it. But, you can take libraries that perform common, encapsulated macro functionality, like OpenGL, a physics library, or OpenSSL, where people don't care about the inner code ("how it gets done") but instead the result. Asking for floats to be multiplied would be bad. But asking for a cross-product would be much faster. Asking for an SHA-512 key would be super.
And the benefit here is, you don't have to hardcode that functionality into a CPU. The FPGA can have NEW or improved IP cores downloaded with Windows Update every week.
Back when I was in college, I actually bought a Lattice dev kit with a PCI Express card, dual-gigabit ethernet, DDR3, and an near top-of-the-line FPGA on the board and it cost a mere $100. Unfortunately, I was a more software guy so I really got in over my head (plus health issues set me back and have never let up since), so I never got a working prototype built.
But it's still there! A huge opportunity waiting to be seized that could really become another tier of "the standard PC." In the same way we think of SSDs as "almost RAM" scratchpads, or GPUs as "CPUs for massive amounts of simple decisions." Well an FPGA is the "GPU of GPUs". Even simpler decisions and insanely fast and parallel even at "low" (by CPU standard) clockrates of 400 MHz.
Here's an older (2009) project/research article that inspired me called the 512 FPGA cube.
Those are massive differences in numbers in both power efficiency and compute rate. 72,000 Watts of Xeons to get the speed of a single 832 watt cube. That's two orders-of-a-magnitude!
I mean, imagine a world if they bothered to make FPGA's you could plug into Ethernet, and make them configure themselves according to a simple programming tool that was "easy" for normal programmers to exploit instead of requiring intense understanding of logic gates, propagation delay, and so on. A tool that wasn't "as fast as" a dedicated engineer, but 90% (or even 70%) as fast at zero cost and effort. All a sudden you could run tons of programs and macro-sized functions like you had personally stamped them into a printed circuit yourself, but without spending millions on development.
I'm honestly not sure why this hasn't already happened. I can't be the only "smart" person who came up with this idea. And the research (and practice with bitcoin miners) all points to a huge opportunity to be exploited if they could lower the knowledge barrier-to-entry so you can basically "push a button" and unleash an FPGA at a problem. Imagine LAPACK and BLAS with FPGA support.
It just doesn't deliver on performance, or energy consumption. That was always going to be the case (you are adding a layer of abstraction at the silicon level). So for anything that is measured on "operations / second" or "operations / joule", FPGAs will always lose out. By now industry has learned that the key is to tailor algorithms for what we can do fast (vector+branching on CPU, everything branchless on GPU), not shoehorn silicon into algorithms.
So what can FPGA do? Fast, low latency, high bandwidth interaction with peripherals. The irony here is that to have this work out, you kind of want to have your peripheral connected to the FPGA.. which takes away all the fun from the reconfigurable stuff, because you can't reroute your PCB. So now 99% of FPGAs deployed end up running in the same configuration always and companies with the necessary scales pour it into ASICs.
FPGAs solve a niche problem of interacting with very fast, massively parallel data buses and systems (think CCD sensors, ADC sampling, ..) that a linear execution, Turing style processor isn't suitable for. And pretty much only for applications where you don't have the volume to convince a chip manufacturer to put your peripheral into silicon.
> As CPUs are designed for one thing "few, large, complex cases" while FPGAs are perfect for many, parallel simple cases" even more than a GPU.
FPGAs are quite good at parallel simple cases, that is correct, but they would lose to GPUs in performance/watt in most cases.
Where FPGAs really shine is in parallel complex, non-uniform cases, especially cases that don’t map well to the classic CPU instructions, but can easily be performed with small latency on FPGAs.
FPGAs own low latency computation (less than 1 microsecond) because GPUs really need 3-20 microseconds to initialize after a kernel launch. This is why they're used instead of GPUs at the front line of high frequency trading. When I was at a hedge fund, I tried in vain to get Nvidia to do something about this based on the unofficial work of another former Nvidia employee implying this could be improved dramatically.
All that said, these are golden years to be a low-level programmer who understands parallel algorithms whether you work in Tech or at a hedge fund because there just aren't that many of us.
But the real problem with FPGAs is that even if they find another lucrative application where they excel relative to GPUs, Nvidia can simply dedicate transistors in their next GPU family to erasing that advantage as they did with 8-bit and 16-bit MAD instructions in Pascal and with the tensor cores in Volta. Too bad they don't care about latency or I believe they could disrupt FPGAs from HFT in a year or two when someone started using them and started winning.
Especially if the level of parallelism isn't too large, or if the memory bw requirement for each is low. The memory bandwidth of FPGAs is comically small typically compared to GPUs if it has to go to off-chip memory. Internal memory is limited to a couple MBit typically.
0) Trying to do automatic parallelization is something we've been working on for 50 years and we still haven't solved in any practical degree. You can't just slap a #pragma on C/C++ code at this point to say "run this on some non-CPU architecture" and expect to get good performance.
this (or something very close to this) has been described in 1993 (!) paper/article called 'processor reconfiguration through instruction set metamorphosis' or PRISM.
perhaps you might find it interesting. have fun :)
I've been thinking of something similar recently - to extend your idea slightly, why couldn't the flashing of the FPGA happen per tick? Thus at every tick your FPGA could become entirely different hardware, tailored to whatever task is required at that tick.
Well, flashing is pretty slow compared to modern computers. You have to load it from flash (no pun) memory and Altera are the only ones (AFAIK) that support only changing subsections. (Which could be great because you can change only a half or quarter of your FPGA with new logic units while the others keep running.)
Also, you'd have to know what you'd need... before you need it. Which is kind of impossible. By time you know you need tons of integer units, you probably could have started working on them. That is, if you need to rapidly switch, then your workload is pretty rapidly completed to begin with.
However, I don't thing they need to reconfigure that fast. Once every second would be enough to keep up with most workloads. Most "heavy duty" workloads aren't changing that rapidly. You load a video game, it's a videogame for hours you play it. You load a web server, you're going to be doing SSL.
If you need much more fine control, it'd probably be better to treat the problem at a much higher level ("I need more SSL keys / sec", instead of "I need more integer adds to make SSL keys") or add another FPGA (one for each use case or set of use cases, ala one for web server keys, one for deciding some other major web server feature).
Of course, I'm no expert in the field. I'm just a guy with an idea and some experience / research into FPGA's as reconfigurable logic units.
Yes, it's the predictive aspect of it that I've struggled with also! I understand there are technical limitations at present, was just trying to extend the idea beyond what's currently possible to see if it might be interesting.
> CPUs are designed for one thing "few, large, complex cases" while FPGAs are perfect for "many, parallel simple cases" even more than a GPU
<offtopic>Hmm, where have I heard something like this before ... ah, yes, the brain - CPUs/FPGAs are like reason and instinct, because reason deals with "few, large, complex cases" and instinct has "many, parallel simple cases". The brain has its own CPU/FPGA divide.</>
Wow that is a solid plan. I propose you or whomever builds it calls is Nitro or something to that affect. Speed how you need it. Imagine an open library or FPGA profiles for popular apps. Build one for Photoshop and you have yourself a nice biz.
I've had a vision for "the future of computing". FPGAs that reconfigure themselves (<-these exist) to become whatever macro-level hardware assets your computer needs. Running tons of SHA-256 encryptions per second? The CPU/OS/OpenSSL (or whatever) detects this condition and switches from hand-coded ones, to an IP core that comes with the CPU. The CPU flashes the FPGA's to become SHA-256 "CoreS" and now you're running 4096x the output with less heat. (As CPUs are designed for one thing "few, large, complex cases" while FPGAs are perfect for many, parallel simple cases" even more than a GPU.) Now, you shutdown your encoding and switch to video encoding, or Doom 2019 and your CPU reflashes (Alteria specialized in PARTIALLY flashable FPGAs so you don't have to nuke the entire FPGA, only sections) and adds cores for video, or physics, or "shader units".
This would be hard for a single person, but any large company could handle making this. You can even do it with off-the-shelf FPGAs. The biggest problems are 1) bandwidth. The "macro" function size needs to be bigger than the latency hit you take for asking the FPGA over computing it internally. (Intel's on-CPU FPGA would be insanely fast access.) and the other one is 2) How do you get people to use it! The simplist is of course, only supporting people who actually request it. But, you can take libraries that perform common, encapsulated macro functionality, like OpenGL, a physics library, or OpenSSL, where people don't care about the inner code ("how it gets done") but instead the result. Asking for floats to be multiplied would be bad. But asking for a cross-product would be much faster. Asking for an SHA-512 key would be super.
And the benefit here is, you don't have to hardcode that functionality into a CPU. The FPGA can have NEW or improved IP cores downloaded with Windows Update every week.
Back when I was in college, I actually bought a Lattice dev kit with a PCI Express card, dual-gigabit ethernet, DDR3, and an near top-of-the-line FPGA on the board and it cost a mere $100. Unfortunately, I was a more software guy so I really got in over my head (plus health issues set me back and have never let up since), so I never got a working prototype built.
But it's still there! A huge opportunity waiting to be seized that could really become another tier of "the standard PC." In the same way we think of SSDs as "almost RAM" scratchpads, or GPUs as "CPUs for massive amounts of simple decisions." Well an FPGA is the "GPU of GPUs". Even simpler decisions and insanely fast and parallel even at "low" (by CPU standard) clockrates of 400 MHz.
Here's an older (2009) project/research article that inspired me called the 512 FPGA cube.
http://cc.doc.ic.ac.uk/projects/prj_cube/Welcome.html
http://cc.doc.ic.ac.uk/projects/prj_cube/spl09cube.pdf
And here's a direct link to the data table results between FPGA, FPGA cube, and Xeon (and cluster of Xeons) trying to do the same work:
https://i.imgur.com/byjmEDG.png
Those are massive differences in numbers in both power efficiency and compute rate. 72,000 Watts of Xeons to get the speed of a single 832 watt cube. That's two orders-of-a-magnitude!
I mean, imagine a world if they bothered to make FPGA's you could plug into Ethernet, and make them configure themselves according to a simple programming tool that was "easy" for normal programmers to exploit instead of requiring intense understanding of logic gates, propagation delay, and so on. A tool that wasn't "as fast as" a dedicated engineer, but 90% (or even 70%) as fast at zero cost and effort. All a sudden you could run tons of programs and macro-sized functions like you had personally stamped them into a printed circuit yourself, but without spending millions on development.
I'm honestly not sure why this hasn't already happened. I can't be the only "smart" person who came up with this idea. And the research (and practice with bitcoin miners) all points to a huge opportunity to be exploited if they could lower the knowledge barrier-to-entry so you can basically "push a button" and unleash an FPGA at a problem. Imagine LAPACK and BLAS with FPGA support.