Great write-up! It is a pleasure to see more people explore this area. You can m...

bambax · on Jan 14, 2024

Is there a more detailed write-up somewhere? I have llama.cpp on a server that I use via a web interface, but what would be the next steps to be able to talk to it? How do you actually connect speech recognition and wake-word on one side, to the server, to speech generation on the other side?

abdullin · on Jan 14, 2024

I'm not aware of any detailed write-ups. Mostly gathered information bit by bit.

On a high level here is how it is working for us:

0. When voice assistant device (ESP32) starts, it establishes web-socket connection to the server. 1. ESP32 chip is constantly running wake-word detection (there is one provided out-of-the-box by ESP-IDF framework (by Expressif) 2. Whenever a wake-word is detected (we trained a custom one, but you can use the ones provided by ESP), chip starts sending audio packets to the backend via web-sockets.

3. Backend collects all audio frames until there is a silence (using voice activity detection in Python). As soon as the instruction is over, tell the device to stop listening and:

4. Pass all collected audio segments to speech detection (using python with custom wav2vec). This gives us the text instruction.

5. Given a text instruction, you could trigger locally llama.cpp (or vLLM, if you have a GPU) or call remote API. It all depends on the system. We have a chain of LLM pipelines and RAG that compose our "business logic" across a bunch of AI skills. What's important - there is a text response in the end.

6. Pass the text response to speech-to-text model on the same machine, stream output back to the edge device.

7. Edge device (ESP32) will speak the words or play MP3 file you have sent the url to.

Does this help?

GeoAtreides · on Jan 14, 2024

Not OP, but amazing work, really really great! esp32-s3 are quite capable chips. Was it hard to train the custom wake-word?

abdullin · on Jan 14, 2024

Thanks!

Custom wake-word on a chip is a bit of a pain. So we are running two models. One on the chip and the second, more powerful, on the server. It filters out false positives.

bambax · on Jan 14, 2024

> Does this help?

Yes, thank you! Great description. Will try! ;-)

herbst · on Jan 14, 2024

Just ordered 2 esp32-s3. Any recommendations for a microphone? I guess that will be the hardest part still

iamflimflam1 · on Jan 14, 2024

Go for an I2S MEMS microphone. Avoid analog microphones as they'll be very noisy and the ADCs on the ESP32 range are pretty rubbish.

You're pretty much limited to PDM microphones nowadays though there are some PCM ones still knocking around. PCM mics are considerably cheaper.

Audio is well supported on the ESP32 and there are plenty of libraries and sample code out there.

herbst · on Jan 14, 2024

My last experiments have been with a logitech camera as mic, worked kinda well but unreliable. Seeing forward to the chips ive ordered

abdullin · on Jan 14, 2024

We are using inmp441. They work well with ESP IDF libraries shipped by Expressif.