I've been slowly working on my own simple home "Alexa" using mostly CMUSphinx for the voice detection. Honestly my most successful methods involved the least amount of complex NLP.
Just simply treating the sentence as a bag of words and looking for "on" or "off" or "change" (and their synonyms) and the presence of known smart objects works extremely well. I could say "Hey Marvin, turn on the lights and TV", or "Hey Marvin, turn the lights and TV on", or even "Hey Marvin, on make lights and TV."
(It's named Marvin it after the android from The Hitchhiker's Guide, my eventual goal is to have it reply with snarky/depressed remarks).
Adding 30 seconds of "memory" of the last state requested also made it seem a million times smarter and turns requests into a conversation rather than a string of commands. If it finds a mentioned smart object with no state mentioned, it assume the previous one.
"Hey Marvin, turn on the lights." lights turn on "The TV too." tv turns on
The downside to this approach is I would be showing it off to friends, and it could mis trigger. "Marvin turn off the lights." lights turn off "That's so cool, so it controls your TV, too?" TV turns off But it was mostly not an issue in real usage.
Ultimately I've got the project on hold for now because I can't find a decent, non-commercial way of converting voice to text. I'd really rather not send my audio out to Amazon/Google/MS/IBM. Not just because of privacy, but cost and "coolness" factor (I want as much as possible processed locally and open-source).
CMUSphinx's detection was mostly very bad. I couldn't even do complex NLP if I wanted because it picks up broken/garbled sentences. I currently build a "most likely" sentence by looping through sphinx's 20 best interpretations of the sentence and grabbing all the words that are likely to be commands or smart objects. I tried setting up Kaldi, but didn't get it working after a weekend and haven't tried again since. I don't really know any other options to use aside from CMUSphone, Kaldi, or a butt SaaS.
I've wanted to add a text messaging UI layer to it. Maybe I'll use that as an excuse to try playing with ParseySaurus.
> I've got the project on hold for now because I can't find a decent, non-commercial way of converting voice to text. I'd really rather not send my audio out to Amazon/Google/MS/IBM
Same concern here... so my voice->text method is via android's google voice - forced to offline mode. The offline mode is surprisingly good.
Re mis triggers... I also have opencv running on the same android. It only activates the voice recognition when I am actually looking directly at the android device (an old phone).
> text method is via android's google voice - forced to offline mode. The offline mode is surprisingly good.
I actually tried this at one point with a wall-mounted tablet before trying Sphinx. It is surprisingly good for offline, probably the best offline I've tried yet outside of dedicated software like Dragon. But it doesn't meet my open criteria, so I'm hoping to find something better.
I'll most likely give up on the requirements of it needing to be local and open, and use Sphinx for hotword detection to send the audio out to AWS for processing.
> Re mis triggers... I also have opencv running on the same android. It only activates the voice recognition when I am actually looking directly at the android device (an old phone).
That's an awesome idea :) I haven't gotten around to playing with anything vision based yet. But I've thought of 'simple' projects like that, which would add a lot to the perceived intelligence. Figuring out the number of people in a room would be another useful idea I think. The AI could enter a guest mode when there is more than 1 person in the room, or when it detects faces that aren't mine, or something similar.
With the leaps and bounds being made in ml these days it can't be long before magnitudes better open source voice recognition becomes available. I gave Sphinx a try but it was horribly disappointing.
For me, the combination of google voice (offline) and Ivona voice (Amy) is pretty damn good for my android/python/arduino based home AI.
Sounds interesting, do you have a writeup or some other details somewhere? (How do you force android voice recognition to work offline? Just block the phone from the internet?)
Kaldi is not a point-and-click solution, it's a toolkit to develop your own speech recognition system. That said, it makes it incredibly easy if you know what you're doing, as it brings all the necessary tools and even provides some data to train your models (see the associated resources at http://openslr.org/). It's performance is state of the art.
This was recently mentioned on HN, but I haven't really looked into it (apparently requires training your own models, but provides prepared scripts to do that for some common datasets): https://github.com/mozilla/DeepSpeech
Must have slipped past me last time it was posted on HN. Thanks for sharing! I'm going to add this to my list of things to try next time I'm inspired to work on this project again.
Just simply treating the sentence as a bag of words and looking for "on" or "off" or "change" (and their synonyms) and the presence of known smart objects works extremely well. I could say "Hey Marvin, turn on the lights and TV", or "Hey Marvin, turn the lights and TV on", or even "Hey Marvin, on make lights and TV."
(It's named Marvin it after the android from The Hitchhiker's Guide, my eventual goal is to have it reply with snarky/depressed remarks).
Adding 30 seconds of "memory" of the last state requested also made it seem a million times smarter and turns requests into a conversation rather than a string of commands. If it finds a mentioned smart object with no state mentioned, it assume the previous one.
"Hey Marvin, turn on the lights." lights turn on "The TV too." tv turns on
The downside to this approach is I would be showing it off to friends, and it could mis trigger. "Marvin turn off the lights." lights turn off "That's so cool, so it controls your TV, too?" TV turns off But it was mostly not an issue in real usage.
Ultimately I've got the project on hold for now because I can't find a decent, non-commercial way of converting voice to text. I'd really rather not send my audio out to Amazon/Google/MS/IBM. Not just because of privacy, but cost and "coolness" factor (I want as much as possible processed locally and open-source).
CMUSphinx's detection was mostly very bad. I couldn't even do complex NLP if I wanted because it picks up broken/garbled sentences. I currently build a "most likely" sentence by looping through sphinx's 20 best interpretations of the sentence and grabbing all the words that are likely to be commands or smart objects. I tried setting up Kaldi, but didn't get it working after a weekend and haven't tried again since. I don't really know any other options to use aside from CMUSphone, Kaldi, or a butt SaaS.
I've wanted to add a text messaging UI layer to it. Maybe I'll use that as an excuse to try playing with ParseySaurus.