Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Great write-up! It is a pleasure to see more people explore this area.

You can make it even more lean and frugal, if you want.

Here is how we built a voice assistant box for Bashkir language. It is currently deployed at ~10 kindergartens/schools:

1. Run speech recognition and speech generation on server CPU. You need just 3 cores (AMD/Intel) to have fast enough responses. Same for the SBERT embedding models (if your assistant needs to find songs, tales or other resources).

2. Use SaaS LLM for prototyping (e.g. mistral.ai has Mistral small and mistral medium LLMs available via API) or run LLMs on your server via llama.cpp. You'll need more than 3 cores, then.

3. Use ESP32-S3 for the voice box. It is powerful enough to run wake-word model and connect to the server via web sockets.

4. If you want to shape responses in a specific format, review Prompting Guide (especially few-shot prompts) and also apply guidance (e.g. as in Microsoft/Guidance framework). However, normally few-shot samples with good prompts are good enough to produce stable responses on many local LLMs.

NB: We have built that with custom languages that aren't supported by the mainstream models, this involved a bit of fine-tuning and custom training. For the main-steam languages like English, things are way more easy.

This topic fascinates me (also about personal assistants that learn over time). I'm always glad to answer any questions!



Is there a more detailed write-up somewhere? I have llama.cpp on a server that I use via a web interface, but what would be the next steps to be able to talk to it? How do you actually connect speech recognition and wake-word on one side, to the server, to speech generation on the other side?


I'm not aware of any detailed write-ups. Mostly gathered information bit by bit.

On a high level here is how it is working for us:

0. When voice assistant device (ESP32) starts, it establishes web-socket connection to the server. 1. ESP32 chip is constantly running wake-word detection (there is one provided out-of-the-box by ESP-IDF framework (by Expressif) 2. Whenever a wake-word is detected (we trained a custom one, but you can use the ones provided by ESP), chip starts sending audio packets to the backend via web-sockets.

3. Backend collects all audio frames until there is a silence (using voice activity detection in Python). As soon as the instruction is over, tell the device to stop listening and:

4. Pass all collected audio segments to speech detection (using python with custom wav2vec). This gives us the text instruction.

5. Given a text instruction, you could trigger locally llama.cpp (or vLLM, if you have a GPU) or call remote API. It all depends on the system. We have a chain of LLM pipelines and RAG that compose our "business logic" across a bunch of AI skills. What's important - there is a text response in the end.

6. Pass the text response to speech-to-text model on the same machine, stream output back to the edge device.

7. Edge device (ESP32) will speak the words or play MP3 file you have sent the url to.

Does this help?


Not OP, but amazing work, really really great! esp32-s3 are quite capable chips. Was it hard to train the custom wake-word?


Thanks!

Custom wake-word on a chip is a bit of a pain. So we are running two models. One on the chip and the second, more powerful, on the server. It filters out false positives.


> Does this help?

Yes, thank you! Great description. Will try! ;-)


Just ordered 2 esp32-s3. Any recommendations for a microphone? I guess that will be the hardest part still


Go for an I2S MEMS microphone. Avoid analog microphones as they'll be very noisy and the ADCs on the ESP32 range are pretty rubbish.

You're pretty much limited to PDM microphones nowadays though there are some PCM ones still knocking around. PCM mics are considerably cheaper.

Audio is well supported on the ESP32 and there are plenty of libraries and sample code out there.


My last experiments have been with a logitech camera as mic, worked kinda well but unreliable. Seeing forward to the chips ive ordered


We are using inmp441. They work well with ESP IDF libraries shipped by Expressif.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: