More

keveman · 2024-10-27T19:30:30 1730057430

Moonshine author here: The warning is from Keras library, and is benign. If you didn't get any other output, it was probably because the model thought there was no speech (not saying there really was no speech). We uploaded ONNX version that is considerably faster than the Torch/JAX/TF versions, and is usable with less package bloat. I hope you would give it another shot.

keveman · on Aug 29, 2023

This is for speech to text, so generating text, not audio. And on a $120-$170 device, this transcribes at 30x real time. The code does run on lower end Rockchip processors, costing ~$30, although only at 10x real time speed.

smpanaro · on Aug 29, 2023

Sorry for the confusing phrasing about STT vs TTS. I'm not familiar with cases where you would use something like this 'at the edge' instead of say a laptop. I was thinking maybe some sort of offline setup with a microphone -- but in that case the audio is just real-time. Do you have some use cases in mind?

1/4 of the price for 1/3 of the speed is a good deal! Presumably still faster than faster-whisper on the same hardware?

keveman · on Aug 29, 2023

This enables a true natural language voice interface to any device/appliance that currently has a touch pad or a bunch of buttons. Yes, faster than faster-whisper, but that's really an apples to oranges comparison, because useful-transformers uses the NPU on the Rockchip processors, so it works only on those. Whereas faster-whisper works fast on most/all platforms.

keveman · on Aug 28, 2023

The tiny.en Whisper model transcribes speech at 30x real-time speeds on an Orange Pi 5.

keveman · on Aug 5, 2023

I met Bram in the Google Mountain View office. We chatted for over two hours. He was full of humility and curiosity and more interested in what I was working on (I was working on DistBelief back then). Hats off to a life full of impact and a legacy that will continue to impact programmers all over the world. RIP.

keveman · on Aug 19, 2019

> How is this programmed?

Full disclosure: I am a Cerebras employee.

There is extensive support for TensorFlow. A wide range of models expressed in the TensorFlow will be accelerated transparently.

paulsutter · on Aug 19, 2019

He was asking about the implications for yields. Do you route around bad dies/cores, and what are the implication for programming and performance?

For everyone else: normally a wafer is divded into dies, each of which (loosely) are a chip. Yield is a percentage of good parts and it's very unlikely that an entire wafer is good. Gene Amdahl estimated that 99.99% yield is needed for successful wafer scale integration:

https://en.wikipedia.org/wiki/Wafer-scale_integration

Veedrac · on Aug 19, 2019

> For example, the typical 300mm wafer from TSMC may contain “a modest hundred number of flaws,” said Feldman. Cerebras gave its Swarm interconnect redundant links to route around defective tiles and allocated “a little over 1% [of the tiles] as spares.”

https://www.eetimes.com/document.asp?doc_id=1335043&page_num...

gwern · on Aug 19, 2019

Looking at the whitepaper, I'm a little surprised how little RAM there is for such an enormous chip. Is the overall paradigm here that you still have relatively small minibatches during training, but each minibatch is now vastly faster?

ivalm · on Aug 19, 2019

IIRC they use batch size = 1 and each core only know about one layer. Which is to say this thing has to be trained very differently from normal SGD (but requires very little memory). There is also the issue that they rely on sparseness, which you get with relu activations, but if, for example, language models move to gelu activations they will be somewhat screwed.

IshKebab · on Aug 19, 2019

It's because it's SRAM, not DRAM. Think how much L3 cache your processor has. A few MB probably. That's what this chip's memory is equivalent to.

morphle · on Aug 20, 2019

We have up to 160 GB SRAM on our WSI. The rest of the transistors can be a few million cores or reconfigurable Morphle Logic (an open hardware kind of FPGA)

Our startup has been working on a full Wafer Scale Integration since 2008. We are searching for cofounders. Merik at metamorphresearch dot org

Veedrac · on Aug 19, 2019

“full utilization at any batch size, including batch size 1”

https://www.cerebras.net/

gwern · on Aug 19, 2019

That doesn't really mean anything. It (and any other chip) had better be able to run at least batch size 1, and lots of people claim to have great utilization... It doesn't tell me if the limited memory is part of a deliberate tradeoff akin to a throughput/latency tradeoff, or some intrinsic problem with the speedups coming from other design decisions like the sparsity multipliers, or what.

Veedrac · on Aug 19, 2019

Most of the chip is already SRAM, I'm not really sure what else you would expect?

18 GiB × 6 transistors/bit ≈ .93 trillion transistors

gwern · on Aug 20, 2019

Well, it could be... not SRAM? It's not the only kind of RAM, and the choice to use SRAM is certainly not an obvious one. It could make sense as part of a specific paradigm, but that is not explained, and hence why I am asking. It may be perfectly obvious to you, but it's not to me.

Veedrac · on Aug 20, 2019

You basically have the option between SRAM, HBM (DRAM), and something new. You can imagine the risks with using new memory tech on a chip like this.

The issue with HBM is that it's much slower, much more power hungry (per access, not per byte), and not local (so there are routing problems). You can't scale that to this much compute.

gwern · on Aug 20, 2019

But HBAM and other RAMs are, presumably, vastly cheaper otherwise. (You can keep explaining that, but unless you work for Cerebras and haven't thought to mention that, talking about how SRAM is faster is not actually an answer to my question about what paradigm is intended by Cerebras.)

Veedrac · on Aug 20, 2019

They say they support efficient execution of smaller batches. They cover this somewhat in their HotChips talk, eg. “One instance of NN, don't have to increase batch size to get cluster scale perf” from the AnandTech coverage.

If this doesn't answer your question, I'm stuck as to what you're asking about. They use SRAM because it's the only tried and true option that works. Lots of SRAM means efficient execution of small batch sizes. If your problem fits, good, this chip works for you, and probably easily outperforms a cluster of 50 GPUs. If your problem doesn't, presumably you should just use something else.

zackmorris · on Aug 19, 2019

Do you support MATLAB or GNU Octave? I'm looking for the level of abstraction below TensorFlow because I find pure matrix math to be more approachable. Admittedly, I'm not super experienced with TF, so maybe it can encapsulate them.

Also, do you have a runtime to run the chip as a single 400,000 core CPU with some kind of memory mapped I/O so that a single 32 or 64 bit address space writes through to the RAM router through virtual memory? I'm hoping to build a powerful Erlang/Elixer or Go machine so I can experiment with other learning algorithms in realtime, outside the constraints of SIMD-optimized approaches like neural nets. Another option would be 400,000 virtual machines in a cluster, each running a lightweight unix/linux (maybe Debian or something like that). Here is some background on what I'm hoping for:

https://news.ycombinator.com/item?id=20601699

See my other comments for more. I've been looking for a parallel machine like this since I learned about FGPAs in the late 90s, but so far have not had much success finding any.

streetcat1 · on Aug 19, 2019

So why are you not publishing benchmarks against nvidia?

sanxiyn · on Aug 19, 2019

Cerebras is an MLPerf member, so they will publish MLPerf numbers some day and then we will talk.

streetcat1 · on Aug 19, 2019

They probably run the benchmark (I guess many times, and not only against nvidia). But yet it is not in the white paper.

I was an SE at an hardware company and it is the first thing that you do as a product manager.

McP · on Aug 19, 2019

What is an SE?

streetcat1 · on Aug 19, 2019

Software engineer.

The_rationalist · on Aug 20, 2019

How do you achieve this? Tensorflow does not support openCL.

rrss · on Aug 20, 2019

I'm sure they wrote a new backend for tensorflow that targets their API. Since the hardware is only for ML, it wouldn't make sense for them to bother trying to implement OpenCL.

keveman · on Aug 30, 2012

Here is the C++11 way of specifying alignment :

    alignas(T) char buffer[sizeof(T)];

Or better still, just use the unrestricted union feature :

    union {
      T t;
    };

Just use &t in place of buffer.

quotemstr · on Aug 30, 2012

Thanks for the tip --- alignas slipped my mind. The compiler I use for most of my work doesn't yet support alignas (or unrestricted unions). I believe a VC++-specific equivalent would for the former trick would be __declspec(align(__alignof(T))).

keveman · on Aug 31, 2012

Yeah, you are right, alignas is still unsupported in many compilers. clang 3.2 supports it. However, unrestricted unions has been supported in many compilers for a long time now. I would use that in this case, since you already know the type 'T'.

keveman · on July 16, 2012

I don't think std::function does any dynamic allocation when you initialize it with simple functions. When you initialize it with function objects or lambda expression, most, if not all implementations use the small object optimization to avoid dynamic allocation when possible.

SamReidHughes · on July 17, 2012

With function objects the GNU std::function does do dynamic allocation. Even for objects as small as an empty struct.

keveman · on May 9, 2012

You are partly right. The frontend of the compiler that translates the CUDA dialect of C++ to LLVM IR is not open source.

keveman · on May 9, 2012

The title is not 100% accurate. CUDA has come to mean the tool chain and ecosystem for GPGPU programming. Part of the system is the particular dialect of C++ in which a programmer can mix and match CPU and GPU code. It also includes the compiler that translates the GPU part to object code (the ISA is called PTX [1] which is one level removed from the actual GPU's ISA). What NVIDIA has open-sourced is the part that translates LLVM IR to PTX. The greatest benefit of this will be for people who are developing alternate programming models/DSLs for programming GPUs. They can translate their DSL to LLVM IR, which they probably are already doing, and then generate PTX using the open-source compiler.

[1] http://en.wikipedia.org/wiki/Parallel_Thread_Execution

_delirium · on May 10, 2012

Yeah, I would call this "CUDA compiler backend" rather than "CUDA compiler". Still a nice release, though.

keveman · on April 20, 2012

A somewhat similar project : https://github.com/copperhead/copperhead