What an odd board. The ESP32-S3 is an absolute powerhouse by itself. I really don't see why would you add another (probably pricey) MCU to serve as the master.
Somewhat shameless but relevant plug - we use the ESP32-S3 for Willow[0]. The dual core S3 and "high speed" external PSRAM are game changers.
What it is capable of (especially considering the price point) is nothing short of incredible. Wake word activation, audio processing (AGC, AEC, etc), audio streaming, even on device speech recognition for up to 400 commands with Multinet. All remarkably performant, easily besting Alexa/Echo in terms of interactivity and response time (even when using an inference server across the internet for speech recognition).
Sure we're down in ESP-IDF land and managing everything we have going on in FreeRTOS is a pain but that's not anything you wouldn't have on any microcontroller. We're also doing a lot considering and generally speaking we "just" throw/pin audio tasks (with varying priority) on core 1 while more-or-less dumping everything else on core 0. Seems to be working well so far!
Interesting, this is yet another open source project that relies on a proprietary wake word model. Why are there no open source wake word engines like there are for speech recognition?
Short answer: it’s a tiny niche that is still very difficult, expensive, and time consuming. This isn’t Whisper or an LLM that has applications in practically anything. You’re not going to see OpenAI, Meta, etc release a microcontroller optimized wake implementation because literally no one cares (by comparison). Yet it requires a not-completely dissimilar level of effort. Someone else linked to the Espressif training dataset requirements.
That said, wake word comes up more than I could have ever imagined. Everyone has an opinion and exactly zero of them have any idea what they are talking about (that I’ve encountered). It’s getting very, very old and I’ve all but given up engaging on it because it always ends with the classic “Well couldn’t you just…”.
Yes, in the 15 seconds it took you to write that comment you came up with something no one in the field has ever thought of before.
If you can create a completely open source wake implementation that gets even remotely close to the performance and reliability of those from Espressif, etc (while running on a microcontroller) we would be thrilled to use it. You will have created the first of its kind, and you’ll be famous!
The more likely outcome (as you dismissively said) will be “yet another open source project” that goes in the graveyard of completely unusable open source wake word implementations. There are plenty. As you note - all of them.
In case anyone with practical skills in ML+deep learning (but not in audio or embedded) wants to tackle this project, I am willing to mentor on those things. Microcontroller-level audio ML is my speciality.
One can find me in the Sound of AI Slack. Ping me there and we can create a channel etc.
It will be a multi-month endeavour though. Getting to PoC level is quite quick, but then getting robust performance in nearfield / high SNR cases across diverse background noise is a lot more work. And then tackling low SNR and far-field like with Alexa is yet another level.
Model architecture wise the problem is rather well understood, there are several good papers available from ARM etc.
Deployement infrastructure also pretty good these days, for example with Tensorflow Lite Micro.
In addition to ML skills, the project would need someone that is good at organizing volunteer outreach, in order to build a good sized dataset. The Espressif docs are a reasonable spec for something quite good on the voice side.
But then we would also need a good dataset of background noise.
Of course those with embedded/microcontroller skills are also very welcome.
Modularity and separation of concerns, in hardware.
Historically, you wanted your PIC or 8051 to be in complete control of the system. You built the core of your RTOS around some well-understood central processor, but farmed out tasks like converting text display over an SPI-like interface to modulating a 16x2 character LCD, or decoding your UART into USB serial to an FTDI chip, or decoding MP3 bitstreams into I2S or raw DAC signals, or what have you. SOCs added on-chip peripherals for some functions, but a lot of stuff was off-chip like in this Arduino.
You wouldn't run code on an FTDI chip, that's an inversion of the architecture, and back in the day that was an ASIC that did nothing but covert RS232 into USB packets so there was no way to run code on it.
The ESP32 is what it is because decoding 802.11 RF signals and running a TCP/IP stack is now a task not for an ASIC but a generic microcontroller that's fast enough to do those tasks in software. The processor is so much more powerful than the ancient Atmel Atmega328 in an Arduino Duemilanove that had 32 KB Flash and 2 KB SRAM on an 8-bit bus at up to a whopping 20 MHz that it seems ridiculous to do anything at all with such a 'master' processor.
Honestly, I think it's a good choice; I can absolutely see folks running a small web server or something on the ESP32 and putting their realtime code on the main MCU, for instance.
When I did electronics for a psychology department, nearly everything had to have both hard real-time sensing and be plug-and-play with Windows and MacOS. We ended up using a lot of Arduinos, but having a webserver on board would have been so much nicer than serial over USB, or transferring from SD cards.
I think this depends on how the ICs are connected. If it's running something like I2C between them, then your data transfer rates will be too low and you'll be quite limited in some of the applications.
For an Arduino board, voltage and pin compatibility is an absolute hard requirement. (The only reason you use Arduino is to use Arduino software or existing peripherals/shields) ESP32 has fewer pins and is a 3.3V part.
yea I go for the S3 when I need networking on one core while still running realtime applications on the other core, but it's the priciest one
i would have probably paired a lower cost -C family (maybe the RISC one for extra nerd cred) with the arduino, since those have only one core, which would have made for a good split