Hacker News new | past | comments | ask | show | jobs | submit login
Offline voice AI within 512 KB of RAM [video] (youtube.com)
142 points by kenarsa on Dec 26, 2018 | hide | past | favorite | 40 comments



"Any sufficiently advanced technology is indistinguishable from a rigged demo." — James Klass

In particular, voice AI systems are by their nature not very discoverable, which makes it hard to assess them quickly. What the video shows is a system that can recognize some input sentences and speak some output sentences. It does not show the size and flexibility of the command set, and it only gives a vague impression of the speech output quality and recognition accuracy.

Technically, you could reproduce this demo with the system software that shipped on a Mac Quadra 840AV in 1993.


This is a really good point. That is why we partially open-sourced our technology to enable unbiased third party evaluation. You can run the exact same demo on a Linux box or Raspberry Pi (any variant) using what's available on the project GitHub repository here: https://github.com/Picovoice/rhino

We are in process of open-sourcing a statistically-significant benchmark for this tech. But this will happen in 2019.


I would love to try and play with this to recognize non-English (lojban). How can I add/train voice samples to this?


Thanks for your commitment to transparency!


I suspect that within the next two years, as this type of thing (voice recognition, voice synthesis, image recognition) becomes ubiquitous, it will cease to be called AI and the bar will be raised further. Very soon, applications such as Duplex [1] will become the minimum bar for AI.

[1] https://ai.googleblog.com/2018/05/duplex-ai-system-for-natur...


I hope so. But investors being so hot to in invest in "the future" that it currently leads to markov chain being called "AI" will probably last for another quarter or two.

It's not as vaporware-y as blockchain, but I still see some potential for the bubble to pop. At some point people will hopefully realize that but everything that can be done with a neural network needs to be done with one (and tons of data).


It looks like you took the exact opposite inference from the parent than I did: it's not about vaporware, it's about people's impossible standards for what's considered "artificial intelligence".

The neural network stuff is anything but vaporware, it's delivered incredible results, but people keep coming up with silly dismissals along the lines of it not being "real AI".


> but people keep coming up with silly dismissals along the lines of it not being "real AI".

That is not what I meant at all. For me, it is the nature of the field and has been that way since the days I first learned it (the 1990's). Once an reproducible algorithm or methodology is discovered to solve an AI problem, it generally ceases to be an AI problem.


Yeah, I actually agree with that. I wonder why you think our statements were incompatible, because that is the sort of thing I would have described as a "silly dismissal". The problem was clearly an AI problem right up until it's solved, and then suddenly it's "not real AI" and people talk about AI being overhyped or whatever.

Imagine if people treated programming languages like that. People would get excited about the idea of communicating with a computer, and then when you finally build Python and show it to them, they say "but that's just parsing, what about a real language?" The bar just keeps rising whenever you get close to it. That's the sort of thing I meant by "silly dismissal".


There's an old saying that AI is AI until its implemented, and then it's just algorithms.


So admittedly I know next to nothing about CNNs and such, but AFAIK, isn't training the difficult, resource intensive part of CNNs? Once you've figured out the coefficients, I think you can implement the pattern matching with an order of magnitude, or less, resources. You don't even need high-precision arithmetic, right?


Yes, and it's a very important property of CNNs. You can do the hard computational work of training in the data center, but get results from inferencing at the edge. In a sense, you're using the trained network as an energy storage system, shifting energy requirements away from power- and cost-constrained devices.

Inferencing is also highly amicable to hardware optimisation, which we're starting to see in the latest flagship mobile SoCs. I expect to see low-cost microcontrollers with inference accelerators within the next couple of years.


Correct, but most neural networks that do this sort of thing have tens of millions of parameters. Doing this in only 150 KB is extremely memory-efficient, if not exactly high fidelity.


You are absolutely right. Compressing (using this for lack of better terminology) is extremely important for power-efficient applications as one of the main power draws on a device is external RAM.

We had to come up with a bunch of ideas on how to fit our stack into the on-chip RAM (512 KB) and leave enough for OS and the actual application.


This is excellent work. Is there a catch? Limited vocabulary versus open domain? Noise or accent sensitive? How large is the model file? What's the accuracy? It's amazing you were able to get anything beyond keyword spotting on such a relatively low power computer.


Thank you. The engine is definitely domain-specific. This would apply to vocabulary and also inference. For example, if you want to use the tech for smart lighting then you would need a different model/context.

The demo is done with some noise and there is some reverberation as well. The speakers are also somewhat accented. That being said we would like to open-source a benchmark for this (similar to other products we have). The comment on accuracy is a bit tricky as it would depend on parameters you mentioned and specific task. I will provide more information when we open source the benchmark in Q1 2019.


Do you have some details on how you have done this?


Yes, we will publish an article about our speech-to-intent engine and add a link to it on our website. This should happen before the new year.

You can find some information about the wake-word engine here https://medium.com/@alirezakenarsarianhari/yet-another-wake-...


When we get voice activated doodads that don't have to send my voice to the mothership I might finally get one.


Still have to shout "GoPro, take a photo", at a crowded location drawing attention. Yikes, no thanks!

I'll be interested when 'reading' thoughts without phoning home becomes a reality.


That's a great opportunity to implement a throat microphone. Unfortunately GoPro's don't have configurable audio input AFAIK


How does this compare to approaches like presented by MSR at NeurIPS this year where also do keyword detection on the Pi Zero or impressively the M4F?

https://dkdennis.xyz/


I don't know regarding your specific case, but one year ago i was able to stitch together CMU Sphinx, Pi Zero, a bluetooth headset and google voice recognition to have a working keyword detection + voice to text system (for italian language).


Thanks a lot for the link. I'll be sure to take a look into it in more detail.

Keyword spotting is one of the modules we run in this demo. That's how we detect "Hey Barista". We also run an engine we can "Speech-to-Intent" that infers user request from follow up command.

One thing I wanted to mention is that there are two challenges when running DNNs on embedded platforms (1) limited compute power (CPU) (2) limited memory (RAM). RPi zero is definitely bound by (1) but not (2) as you get 100 MBs of RAM on it.


See also uTensor (https://github.com/uTensor/uTensor) which scales down TensorFlow-trained models to something you can run on a microcontroller. Currently working on hardware acceleration on MCUs with DSP extensions.


Makes me wonder how much AI we could fit into ZX Spectrum, an old Amiga or 80386...


Not much. Even a tiny arm cortex M4, which could live in a hearing aid for a week on battery life, are often 64MHz with single cycle MACs.

Z80 I believe was something like 4 cycles per instruction and only a megahertz, so we would be talking at least 16x slower than a M4. You would need something between a 486 and Pentium to get close to the M4, and then evenn further to get to the M7. If I remember correctly you couldn't even decode MP3s in realtime until the faster 90+MHz 486.


Maybe it is unimportant, but I once compiled amp on an old SGI Indigo Elan. 30Mhz MIPS R4k something...

It would decode mp3, up to 256Kbps, over NFS, 95 percent CPU utilization.

I was quite surprised.


These things are slow (by modern standards), yeah, but it was possible to get the model this far down ... how far can we go?


Must be a way to prune the nn with some reduction of quality. Probably a means to reduce the quality until it is on an Arduino and some like an anonymous video.

There was speech synethsis on the apple iie in like 64k of RAM and a 1mhz CPU.


> There was speech synethsis on the apple iie in like 64k of RAM and a 1mhz CPU.

Speech synthesis is far far far easier than parsing a human voice, especially when it doesn’t need to sound realistic (as was the case back then).


Current approach of speech recognition is about as old (maximum likelihood using WFST).

In the old papers about problem, there was vocabulary size of 64K words, because nothing was working for bigger vocabularies.


Pruning is a great idea to reduce memory usage. One thing to be careful with is that pruned matrices use irregular memory access and they might be slower as we don't have SIMD support for sparse matrix multiplication on generic CPU's (e.g. ARM) yet.


You may decompose matrix for each layer into two matrices with a bottleneck layer. E.g., A=BC (approximately) where A is MxN matrix and B and C are MxK and KxN matrices, where K is small enough.

It is done using SVD and improves speed without sacrificing memory access patterns.


Considering the 80386 needed two clock cycles for just an INC or ADD, let alone the 12-27 needed for an IMUL/IDIV/memory fetch, I’d say an 80386 would be a bad target. While faster than most others in its time, the 386 was a rather poorly designed CPU.


I'd love to tinker with that for my own home automation projects but couldn't find any indication whether it would be available and affordable for such a use case. Any clues as to licensing fees for private/non-profit use? Especially the offline aspect makes it very attractive, not only from a privacy but also a reliability aspect.


Hello. This is Alireza. I am the founder of Picovoice.

I totally understand the need to support makers community. We do have GitHub repositories for engines demoed here which allows you to use these technologies to some extent (not the full set of capabilities). I am working with our partners (both Soc and distribution) to come up with a maker-specific product for evaluation and personal use. It most probably will be a HW/SW product (i.e. a board that comes with our software). The product should allow you to use the full set of features on that specific board. I am expecting this to happen in 2019 and I will disclose the information as I am figuring things out.


I would love to be able to use this for personal use. I have RSI and WSR and Vocola 3 work OKish for programming, for the most part. But they're locked onto my specific system. I would really like to try to have something embedded to be able to go to any computer, plug it in or use WiFi and be able to dictate at a decent accuracy.


This device is relatively powerful. I'm able to run Fedora with Wayland on similar device (iMX6SL-evk). DNF is slow for first time, so I need to wait 20 minutes until all data is parsed, but then works fine.


The device you are referring to is quite different. I am taking this is the board you are using?

https://www.arrow.com/en/reference-designs/imx6slevk-imx-6so...

It was an ARM Cortex-a9 with NEON extension instead of ARM Cortex-M7. It is basically a different family of i MX processors.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: