Hacker News new | past | comments | ask | show | jobs | submit login
Launch HN: Tensil (YC S19) – Open-Source ML Accelerators
96 points by tdba on March 11, 2022 | hide | past | favorite | 87 comments
Hello HN! I'm Tom, co-founder at Tensil (https://www.tensil.ai/). We design free and open source machine learning accelerators that anyone can use.

A machine learning inference accelerator is a specialized chip that can run the operations used in ML models very quickly and efficiently. It can be either an ASIC or an FPGA, with ASIC giving better performance but FPGA being more flexible.

Custom accelerators offer dramatically better performance per watt than existing GPU and CPU options. Massive companies like Google and Facebook use them to make training and inference cheaper. However, everyone else has been left out: small and mid-sized companies, students and academics, hobbyists and tinkerers currently have no chance of getting custom ML hardware. We aim to change that, starting with ML inference on embedded and edge FPGA platforms. Our dream is that our accelerators help people make new applications possible that simply weren't feasible before.

We believe that advances in AI go hand in hand with advances in computing hardware. As a couple of software and ML engineers hoping to live in a world alongside intelligent machines, we wanted to know why those hardware advances were taking so long! We taught ourselves digital design and gradually realized that the next generation of hardware will need to be finely customized to enable state of the art ML models at the edge, that is, running on your devices and not in the cloud. In the CPU world, the RISC-V RocketChip implementation has proven the value of customizable compute hardware. The problem was that no-one was building that kind of capability for ML acceleration. We started Tensil to build customizable ML accelerators and see what kind of applications people can create with them.

Tensil is a set of tools for running ML models on custom accelerator architectures. It includes an RTL generator, a model compiler, and a set of drivers. It enables you to create a custom accelerator, compile an ML model targeted at it, and then deploy and run that compiled model. To see how to do this and get it running on an FPGA platform, check out our tutorial at https://www.tensil.ai/docs/tutorials/resnet20-ultra96v2/.

We developed an accelerator generator in Chisel and then wrote a parameterizable graph compiler in Scala. (Fun fact: unlike in software, formal verification is actually a totally viable way to test digital circuits and we have made great use of this technique.) The accelerator generator takes in the desired architecture parameters and produces an instance of the accelerator which can be synthesized using standard EDA tools. The compiler implements ML models using the accelerator’s instruction set and can target any possible instance of the accelerator.

Currently, the accelerator architecture is based around a systolic array, similar to well-known ML ASICs. You can view the architecture spec in our documentation. The compiler performs a wide variety of tasks but is optimized for convolutional neural networks. There are also drivers for each supported platform, currently limited to FPGAs running bare-metal or with a host OS.

When you tell the driver to run your ML model, it sets up the input data and then streams the compiled model into the accelerator. The accelerator independently accesses host memory during execution. When the accelerator is done, the driver is notified and looks for the output in the pre-assigned area of host memory.

How are we different from other accelerator options? There are many ML ASICs out there but they are all locked into a single architecture, whereas we have customization at the core of our technology. This offers the potential for a better trade-off between performance/price/watts/accuracy. Compared with other FPGA options, Xilinx DPU is great but it’s closed source and can be difficult to work with if your model is in any way customized. By going open source, we aim to support the widest possible range of models. FINN is a very cool project but requires big changes to your model in order to work, and also typically requires large FPGAs which are unsuitable for edge deployments. We work out of the box with any model (no need to quantize), and on small edge FPGAs. For embedded systems, tflite/tfmicro are great for deploying very small ML models on extremely constrained edge devices, but they are limited in terms of the performance and accuracy that can be achieved. Our tools allow you to work with full size state of the art models at high accuracy and speed.

Currently we're focused on the edge and embedded ML inference use case. If you run ML models using any of the major frameworks (TensorFlow/Keras, PyTorch, etc.) on small, embedded or edge devices then Tensil is a good fit for you right now. If you primarily run inference in the data center or need lots of training acceleration, reach out to us and we can walk you through our roadmap. For now we are focused on CNN inference on edge FPGA platforms, but our aim is to support all model architectures on a wide variety of fabrics for both training and inference.

The core technology will always be free and open source, but we plan to offer a “pro” version with extra enterprise features under a dual license arrangement, similar to Gitlab. We are also working on a cloud service for running our tools in a hosted setup, in which you’ll be able to run a search across all possible Tensil architectures to automatically find the best FPGA for your model.

If you're interested to learn more, check out our docs (https://www.tensil.ai/docs), our Github repo (https://github.com/tensil-ai/tensil) and join our Discord (https://discord.gg/TSw34H3PXr). And feel free to reach out any time (email in profile).

We’re here to enable you to develop amazing new ML based applications, so we’d love to hear your experiences of working with ML compute hardware, whether it be CPU, GPU, or some other specialized platform. Have you had to make major changes to your ML models to get them to run on the available hardware? Are there any cool features or UX improvements that you wish hardware makers would add? Are there features that you’d like to add to your own applications but don’t know how you’d get them to work on an edge device? Looking forward to your comments!




Wow! This looks amazingly impressive. Super-duper good. If the software is as impressive as the website (haven't tried it out yet!), you'll make tons of $$$ on this product. If I were a vc with millions I'd be begging you to take some. I wonder if you plan to support CGRAs and LSTMs? Also, what about quantized and compressed models? Model compression is a sore point and afaik, there aren't any good tools that lets you make tradeoffs between accuracy and compute efficiency.


Thank you for the kind words! Just to clarify, the core technology here is free and open source, anyone can use it right now for free. We do have commercialization plans in addition - we may explore things like additional paid features for enterprise use or paid tiers of extra support.

Regarding LSTMs, yes. We're aiming to support all machine learning model architectures: do you have any particular models you're interested in that we should be prototyping with?

For CGRAs, we don't have any immediate plans to explicitly support them. What kind of use case do you have in mind? Generally, any platform that can implement a blob of generated RTL should be something we can work with quite easily.


I am guessing some kind of vivado replacement is in the works? As it stands now your product seems to be fully offline+CLI - pretty please just tell me how you can get revenue from that?!

That said, I am going to get a PYNQ-Z2 just to try this out! Btw, quick glance at the tutorial says Z1.. can I assume Z2 would be barely an inconvenience?


Yes, Pynq Z2 should work just as well (it's the exact same FPGA, just a slightly different board). We've been testing with Z1 which is why I recommended it in the tutorial.

For commercialization, the core technology will always be free and open source, but we plan to offer a “pro” version with extra enterprise features under a dual license arrangement, similar to Gitlab. We are also working on a cloud service for running our tools in a hosted setup, in which you’ll be able to run a search across all possible Tensil architectures to automatically find the best FPGA for your model. I'd love to hear your feedback on these plans!


Just saw your edit re: model compression. One thing that Tensil can do is help you avoid the need to quantize or compress your model entirely! For example, we've found that using a 16-bit fixed point numeric data type preserves almost all the model accuracy while not sacrificing performance thanks to the huge amount of parallelism available on FPGA.

The broader point is that Tensil is extremely flexible, so you can try out lots of different accelerator configurations to find the one that works best for your ML model. Think of it as optimizing the hardware first, then the software if needed.

We're actually working on a tool to manage and automate this hardware architecture search - watch this space!


Wow! After repeatedly unsuccessfully tying to get an overview over NN accelerators today, i just found this on the HN homepage. Looks very promising and to me this seems to be a very logical approach, in terms of efficiency (besides analog computers).

I would also be very interested in some benchmarks comparing the generated hardware with things like Google Coral or Nvidia Jetson.

I am sure this will be a success.


Glad this helped clarify things for you! The tricky thing about benchmarks is that one of the key benefits of Tensil is the flexibility to find a trade-off between performance, accuracy, cost and power usage that works for you. Benchmarks that only consider performance or performance per watt can be a bit narrow from that point of view. That said, this is a good idea and we'll add some comparisons that we think make sense to the docs!


I wanted to add something about the xilinx dpu, and you brushed on the subject but I was quite unhappy with the softip thing. It embeds all instructions for all kinds of networks so, taking a lot of gates for unused features, it's not much customizable, and perf for anything else than vanilla conv2d stuff quickly gets down. Buying an Alveo board to get such low inference perf was a gutpunch.

FINN seems far better there. At least you get millions inference/sec on simple quantized CNN1Ds.

The xrt api is simple and relatively ok, too. Stream data, execute inference, fetch results, mostly sync, so you have to wrap a lot of threading there, but the basics are there.


Yep, this is something we've heard before. If you're really familiar with the Xilinx ecosystem, one way we've described Tensil is that it is the "Microblaze for ML" - easy to use, lots of flexibility and customizability, with performance good enough for most applications. The DPU and FINN would then be the more specialized tool for situations where you need specific features they are optimized for.


Ha, now you've made me curious. Let's see how everything progresses then. Thanks for the earnestness on these comment threads.


You're very welcome! Stay in touch - I've listed some contact methods here and there in the thread, and we'd love to hear from you again.


Thank you!


One more thing to keep a look on in the NN accelerator world is TensTorrent. That thing looks amazing, but it's mostly for datacenter and 'heavy' edge (pcie board, at least 75W, so to measure against Alveo U50/U55 and Tesla T4/A30 and up to 300W so A40/A100).


Thanks, we'll take a look!


How does this compare to Coral's USB Accelerator [1], which apparently uses Google's TPU? I'm guessing Tensil is better for companies that are already either working with an FPGA or producing custom silicon, but the Coral product might be easier to get started with when prototyping on something like a Raspberry Pi.

[1]: https://coral.ai/products/accelerator


Coral is a great project, especially if you are using a completely vanilla off-the-shelf model. However if you've ever tried compiling a custom ML model for it, you know how finicky it can be. There are lots of ways that you can accidentally make it impossible for Coral to run your model, and it can be difficult to figure out what went wrong.

With Tensil, you circumvent that problem by changing the hardware to make it work for your model. If you have made modifications to an off-the-shelf model or have trained your own one from scratch, it might be a better option from the point of view of ease-of-use and even performance.


Ah, thanks for that clarification. I see that your tutorial is using the Avnet Ultra96 V2 dev board. Do you have anything that would work with a Raspberry Pi? Maybe some kind of FPGA addon board? Or do you feel that the Raspberry Pi isn't a good starting point for developing a real commercial product?


This is a great idea, we're looking at boards that could be used in combination with a Raspberry Pi. The reason we haven't investigated this so far is that most of the dev boards we've tested with have an ARM core embedded in the FPGA fabric, so the additional CPU the Raspberry Pi would provide wasn't necessary.


Looks like pynq-z2 has header pins which connect with raspberry pi. https://www.tulembedded.com/FPGA/ProductsPYNQ-Z2.html


Heh very similar experience with myriad-x there. Going off the beaten path is a pain, especially since the low-level is now so hidden...


Absolutely, the UX for compiler tools often leaves a lot to be desired. This is something we want to fix!


This is hard, very hard stuff. Between MLIR, the xla world, most HLS things (generalist stuff leaves a lot of perf on the table and you often end up in vhdl/asm anyway - while specialised stuff is often too restricted...) and the vivado 'let's write HDL/RTL like C', many broke their teeth.

I wish you good luck there, but you're up a huge task. You have all my congrats for going open source, and I think now it's mostly the only way forward. FINN is OSS and I'm very happy to have an OSS alternative. If only old Altera would go full OSS on new AI+FPGA stuff maybe we'd see great cross pollination.

Anyway, if Intel FPGA people aren't watching this, I can assure you they'll be looking soon.


Thank you - we'd love to see more OSS support from FPGA vendors too and we'll be watching closely for any developments there.


Apart from the discontinued Intel sticks, are there pytorch compatible USB accelerators on the market?


I'm not completely sure if the answer is no but I have had difficulty finding anything like that.


So, Tensil looks really cool. One of the constraints listed in the docs though is that it only supports convolutional networks at the moment.

What does the timeline look like for supporting some of the more popular transformer/attention based arch's look like?


We're working on our roadmap right now and prioritizing support based on user interest. If there's a particular model or set of models you're interested in accelerating, I'd love to hear about it!

If there's a lot of interest in transformers, we'd aim to offer support in the next couple of months.


A lot of SOTA models seem to be gravitating towards transformer based models. Obviously, I can't speak for the entire field, but you can just go take a look at the most popular HuggingFace repos and see what I mean. They started out focused on language, but because transformers have become so popular, they're expanding into the audio and vision domains quickly. Their library called 'transformers' is, outside of research, most peoples go to high level framework as it largely abstracts away a lot of the boilerplate that writing in pure TF, PyTorch, Jax requires.

See:

https://huggingface.co/spaces

https://github.com/huggingface/transformers


Agreed, this is the way things seem to be trending. We'll definitely add support for transformers in the near future, the question is only whether there are other things we should work on first, especially with respect to the edge and embedded domain where smaller conv models still dominate. Thank you for the links!


Wow, congratulations on the launch! I agree whole-heartedly that custom accelerators will fuel the next era of AI/ML advances.

I'm the founder of PrintNanny https://printnanny.ai/, which seems to fit the current use case for Tensil. My model's architecture is a "classic" CNN feature extractor, SSD box/region proposals, with a final non-max suppression op. I currently run a uint8 quantized TensorFlow Lite model on Raspberry Pi, without additional acceleration - but I'm very familiar with the hassle of using partially-closed source accelerators like Coral's Edge TPU. Excited to read through the graph compiler!

I joined your Discord, looking forward to tracking Tensil's progress.

I confess that I'm curious how you currently or intend to make money? How much time are you giving yourselves to figure out a sustainable financial model?

For what it's worth, I'm also ex-Red Hat and thoroughly understand the advantages of paying for high-quality support. I also want to re-iterate that I think that the accessibility of open-source ASIC/FPGA tools will define the future of AI/ML. This is important work that will change the world - I'm excited to see someone tackling it!


Awesome, I think your use case would make a lot of for Tensil. Looking forward to chatting more!

The core technology will always be free and open source, so to commercialize Tensil we're planning to offer a "pro" version which would operate under a paid license and provide features specifically needed by enterprise users. We're also working on a web service that will let you run Tensil's tools in a hosted fashion, with one major feature being the ability to search across a large number of potential architectures to find the best FPGA for your needs. Extra paid support and other services will also be in the mix.


What kind of FPGAs can this reasonably run on? Is that model dependent? Could a small model run on an ICE40 FPGA? I looked over the doc but I can't find anything concrete.


It depends on the model, yes. Here are some examples in the benchmarks section of our docs: https://www.tensil.ai/docs/reference/benchmarks/

We haven't specifically tested on any ICE40 FPGAs yet - if this is something that you'd really like to see, let me know! Taking a look at the lineup, the ICE40 LP8K and LP4K would be suitable for running a very small version of the Tensil accelerator. You'd want to run a small model in order to get reasonable performance.

Generally speaking, FPGAs with some kind of DSP (digital signal processing) capability will work best, since they can most efficiently implement the multiply-accumulate operations needed.


I think iCE40 LP/HX series are the biggest ones, but the iCE40UP5K is also neat: it has hardware multipliers unlike the LP/HX, and a relatively large 1 megabit RAM on-chip. Unfortunately, I think the UP family is relatively slow (as in propagation delay/max clock frequency).


Thanks for pointing this out! The UP5K does look promising.


I'm curious if you have any benchmark (or anecdotal evidence) on the relative perf&power efficiency of using the DSP blocks of the FPGA boards or not?


I don't have hard numbers at hand, but I'd estimate something like an order of magnitude improvement for using DSP for multiplication vs not. If they're available on the fabric, you'll definitely want to use them! If this is an experiment you want to run, I'd be very happy to help you figure out how to do it.


Cool! Yeah I would be interested in that. I would actually have some use cases for edge compute if it can fit into tiny FPGAs like the ICE40.


Here is an example of deploying a basic ML application to an ICE40 using a custom Keras to NN generator https://github.com/edge-analytics/fpga-sleep-tracker

So it definitely can be done with some careful attention to the limited number of multipliers on the device. I’ll be curious to check out how Tensil does in terms of mapping with highly resource constrained FPGAs. Regardless, Tensil looks like a very cool tool.


Wow, awesome project! This is exactly the kind of thing we had in mind when we built Tensil. I'd be very curious to hear what happens if you make a v2 perhaps using Tensil for comparison.


That's excellent - feel free to join our Discord if you'd like to brainstorm ideas or get help choosing models and boards https://discord.gg/TSw34H3PXr


Very cool work, congratulations on the launch! Can you comment on how you see the trend of edge computing evolve in the future for SBCs? In terms of perf per watt, could FPGAs compete against a coral-style TPU? What if we had open Mali GPU or NPU APIs to program against the chips already present on SBCs? I'm just a hobbyist so I know very little of what people actually deploy in industrial settings - which would be your target customers.


Cheers, and great question! FPGAs are pretty amazing devices, but one thing that's been holding them back is how difficult they have been to work with. Typically to actually make use of an FPGA you'd need to have an FPGA expert and an embedded software engineer on your team, along with all the requisite tools and materials.

That has started to change dramatically in the last decade, with open source FPGA toolchains like yosys, runtimes like the PYNQ framework and RTL generator tools like Tensil being developed. When you put these things together, working with FPGAs starts to become as easy as using any other compute platform. For that reason, I think there are lots of applications involving FPGAs that will soon be invented to take advantage of this trend. One could speculate that the reason Intel and AMD are buying up FPGA vendors is because they see the potential there.

As far as head-to-head comparisons go, as long as you're running the workload it was designed for in the environment it was designed for, an ASIC will always be the best possible perf per watt. The question is what happens when you go outside those bounds. Can you take your model, swap out a layer, and have it run just as fast on your Coral or NPU? Probably not, at least right now. But with Tensil, you can re-run your architecture search to find the best accelerator, and take advantage of it right away.


Nice work with this. I was wondering, are all computations other than convolution performed on the FPGA as well - such as pooling, padding, inter-layer quantization operations (rescaling & offset additions)? If not, does the FPGA offload unsupported operations to the host before continuing? Does the FPGA need to transfer intermediate layer IO data back and forth between the host during GEMM if the data become too large to fit on the FPGA SRAM? Thanks


Great questions! With Tensil, all computations are performed on the FPGA. In addition to matrix multiplication Tensil supports SIMD instruction with various operations. The ML activations, average and maximum pooling, normalization, and image resizing use SIMD instruction. Some ML operations, such as padding, are achieved by changing the memory layout. Tensil uses DRAM0 and DRAM1 memory pools (usually in DDR memory) to interact with the host to read model inputs and weights and write outputs. It also uses these pools to offload intermediate results between layers and between tiles within a layer when FPGA does not have sufficient BRAM, which is common on lower-end devices. Tensil compiler takes care of finding the most efficient memory scheduling for given on-FPGA memory size.


Okay thanks, so are DRAM0 & DRAM1 memory pools located on the host DDR memory, or is that a part of separate DDR DRAM hardware located on the FPGA board (kind of like how GPUs have their own separate DDR DRAM)? I definitely want to dive deeper into the source code of this project at some point and see how the compiler and everything works.

Edit: Sorry I think you already clarified that the DRAM0 & DRAM1 memory pools are located on the host


Something like the Alveo PCIe card has onboard HBM/DDR4 memory large enough for Tensil DRAM pools, so this would be similar to how GPU operates but could also reach to host memory via PICe if needed. Embedded applications with Zynq 7 and UltraScale+ have ARM processors on the same chip with FPGA and (usually) DDR as separate chips on one PCB. In this case, Tensil DRAM pools are just contiguous memory blocks in the memory shared with the CPU. We will be publishing documentation on the compiler design soon--stay tuned!


Hi, maybe you've addressed this somewhere and I haven't read fully (sorry) but how does it compare to FINN from Xilinx?


FINN is a very cool project, but usually requires big changes to your model in order to work, e.g. quantizing down to 1 or 2 bit weights. It also works best on large FPGAs which are unsuitable for edge deployments. Tensil works out of the box with any model (no need to quantize / compress) and on small edge FPGAs.


Thanks for your answers.

True that, to get crazy perf with FINN, one needs to quantize like crazy (at least it's the default strategy, but it's something that might change if/when it can synthetize to use dsp slices or shiny Versal Weird Cores). Now I'll have to take a look at Tensil. How would it scale on large FPGAs though? Would you leave the floor planning to a seasoned vhdl person? Does Tensil handle it (generating parrallel pipelines, maxing out performance using all resources on chip) ? Say for someone doing 1D CNNs or some 1D VAEs with (tens of) millions inferences/second on a continuous stream (low batch size)? :-).

I'm not sure what Intel proposes nowadays on that front, with the abandonment of OpenVino for FPGA. No idea how one could use the stratix 10 nx with its 'ai cores' with actual neural networks. Tensil might be a gateway for all this (I sadly don't have much for FINN to become crossplatform...).


So far we've been focused on edge devices like the Zynq, Artix and Zynq Ultrascale+ families. Tensil certainly works on larger devices but it's not as optimized there as we'd like it. If that's interesting to you, I'd love to talk and understand your use case in more depth.

The Intel FPGA side is interesting, as you say there are fewer projects targeting their technologies for ML use cases. We haven't tested support for their boards yet, but there is nothing in our generated RTL that is exclusive to Xilinx. The only thing we'd need to add is new drivers for their platforms.


Would love to take a look at this. We just launched our FPGA-based cloud platform last year and currently we offer all of the Alveo series and some Intel as well. vmaccel.com


VMAccel looks very interesting! Send me an email and we can explore how to collaborate.


(This comment was originally posted at https://news.ycombinator.com/item?id=30615605, where the question made more sense, but I've moved it into the new thread because it's interesting.)


What physical connection is required between the FPGA and host? For example, do they communicate through a PCIe connection?


In our current demos, the Tensil logic talks to the host through a couple of AXI and AXI Stream interfaces. There are AXI adapters for many other protocols, including PCIe, that should be able to support many different kinds of connectivity. Here's a link to our docs explaining the host<->Tensil connection: https://www.tensil.ai/docs/howto/integrate/#2-connect-the-ax...


What sort of latency can you get on edge devices? Are there cases that processing can be done with 3ms or 10ms latency?


Definitely! You can see some of our benchmarks here, and we'll be expanding this list soon https://www.tensil.ai/docs/reference/benchmarks/


What’s the difference? https://hailo.ai/


Generally the comparison between Tensil and any fixed ASIC is going to run along similar lines, which we explain in this comment regarding the Coral accelerator: https://news.ycombinator.com/item?id=30643520#30645318

The big difference is that while those fixed ASICs offer great performance on the set of models they were optimized for, there can be big limitations on their ability to implement other more custom models efficiently. Tensil offers the flexibility to solve that problem.


I'm an ML engineer but I know nothing about the inference part. Are there that many kind of devices that optimizing for inference on a device is a thing? I thought almost everyone serves from GPUs/TPUs and hence there are only 2 major device types. What am I missing here?


There are four big categories of ML accelerators. You already familiar with CPUs and GPUs; then there are FPGAs, which offer better performance and efficiency while remaining flexible. Finally there are ASICs (of which the TPU is an example), which offer the best performance and efficiency but retain very little flexibility, meaning if your ML model doesn't work well on an ASIC then your only option is to change your model.

We chose to focus on FPGAs first because with them we can maximize the usefulness of Tensil's flexibility. For example, if you want to change your Tensil architecture, you just re-run the tools and reprogram the FPGA. This wouldn't be possible with an ASIC. That said, we'll be looking for opportunities to offer an ASIC version of our flow so that we can bring that option online for more users.


I saw somewhere that 95% of all ML inference tasks is still done on CPU


It's true that inference is still very often done on CPU, or even on microcontrollers. In our view, this is in large part because many applications lack good options for inference accelerator hardware. This is what we aim to change!


So, in your opinion, why would those CPU users want to migrate to an FPGA and your software rather than to Nvidia T4 or Tegra and CUDA?


It depends on the application. For some use cases, moving to a GPU makes total sense. However, if you have power constraints, form factor constraints, performance constraints or simply want to be in control of your own hardware, using an FPGA with Tensil may be a better option.


Is this going to be ad-supported in the future? Like, in an IDE?

Assuming you guys are not a nonprofit. :)

Just curious how money can be made from what seems like an FOSS CLI offline product.. maybe maintenance subscriptions somehow, then?


I replied to your other comment here about commercialization https://news.ycombinator.com/item?id=30652150 I hope it was helpful!


Hah not really.. I had already read your canned reply by that time. But I guess it was a combo of my lack of imagination and you not wanting to be constrained/havent fleshed out the details etc. Now, thinking about it, I can imagine a cloud sandboxed IDE interface— like repl.it. The tricky part, how do you interface with a edge/client device. Maybe your compiler emits (special) “wasm” or something.. (you could ship a docker but thats still another moving part — heres where the gitlab-like hosting comes in?) ..pretty sure my wanking here wont help you that much though lol


Great project! How does the performance compare with conventional CPU/GPU based inference? Those devices are usually a lot higher power (and bigger/more expensive), but obviously do not benefit from specialization.


Thanks! The general answer is that it depends on your model and on which FPGA platform we're talking about, but in a head-to-head benchmark test you'll find results in the ballpark of 2-10x CPU and 0.5-2x GPU. As you point out, the power and cost are big differentiators. The other thing to consider is (as another commenter mentioned) that usually inference on CPU or GPU will require you to do some model quantization or compression, which can degrade model accuracy. Tensil can give you a way around that dilemma, so that you can have great performance without sacrificing accuracy.


Hi, I'm curious what you mean about model quantization being necessary on CPU and GPU? They're not necessary by default, as openvino, tvm, tensorrt can run single-precision inference on most classic models quite fast? If you're reaching for very low power or ultimate perf, yeah you can downgrade to fp16 (well... Mixed precision) with NVIDIA tensor cores or avx512-fp16, or bf16 in some Intel vnni confs? Going to integer will give you more throughput too but it's not necessary. Even myriad-x is supposed to handle some kind of fp16 with the shave cores.

The only time I had to reach for quantized (integer) networks to do anything at all was inferencing on FPGAs. Are you targeting dsp slices by default or implementing full ieee754 floating point by default?

Are you saying that with Tensil you can run single precision non-quantized models with up to 2x gpu perf?

I probably misunderstood your last sentence, sorry.

Genuinely curious!


Sorry if this was unclear - in a datacenter use case you are right, but for an edge deployment, you will usually need to quantize, prune or compress your ML model to get it working as fast as you'd like on a sufficiently small CPU/GPU. Compared with running your ML model unchanged on those platforms, Tensil can run with the performance ranges listed above. You can also quantize and use Tensil too!


It'd be great if you could add benchmark numbers for this comparing CPU/GPU on inference / sec and inference / watt.


Will do - as I mentioned in another comment, it can be a bit subtle to find an apples-to-apples comparison, but we'll soon add some cross-platform that we think are reasonable.


Please compare against https://NN-512.com


Sure, we'll check it out!


I tried to look for it but didn't find how much better your compiled model is when compared to tensorflow/pytorch natively run on the device. Do you have this somewhere?


If you have a device with known performance in mind, you can compare against our benchmarks listed here https://www.tensil.ai/docs/reference/benchmarks/

We'll be expanding this list and adding more comparisons to other platforms in the near future.


how does this compare to apache TVM?


Great question - TVM / OctoML are a great option if you have an off-the-shelf ML model and off-the-shelf hardware. Tensil is different in that you can actually customize the accelerator hardware itself, allowing you to get the best trade-off of performance / accuracy / power usage / cost given your particular ML workload. This is especially useful if you want to avoid degrading the accuracy of your models (e.g. through quantization) to achieve performance targets.


That makes sense. So is this only for edge compute use cases, or can I use tensil on an FPGA I have running in my data centre?


You absolutely can use it in a data centre. You can even tape out an ASIC using these designs! Currently we've done most of our prototyping with edge FPGA platforms but if you want to try other platforms we'd love to help you get started. You can email me at tom@tensil.ai or use the contact methods on the website.


How do you guys plan to make $$$?

(Sorry in advance for helping me catch the elephant in the room!)


Congrats Tom! Can’t wait to have a use-case for this (soon!)


All: these guys did a Show HN yesterday at https://news.ycombinator.com/item?id=30615605 (there was a scheduling mixup on my part). I mention it here to (a) explain the dupe, for anyone who saw that thread; but also (b) to tell everyone that the discussion there was unusually high-quality, so you might want to check out those comments first.

Actually, maybe we should just merge that thread into this one. I'll double check if that makes sense.

Edit: ok, I've moved the comments in here now. Some of the times are messed up, but I think it makes more sense for the comments to be in one place so readers don't have to go back and forth. Sorry for any confusion!


Sweet!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: