The first one to support a "conversation" even if with same level of intelligence and understanding today will be a MASSIVE improvement. If it would listen while talking so you can interrupt it. Would be so much more useful.
I feel like I want the opposite. For me, command lines work. They are precise, they are consistent, they are easy to reason about, and they are composable, which allows me to do very powerful things on the fly. There is a reason why I do not have a natural language interpreter wrapped around my Linux command line.
What a lot of these natural language systems do is they force me to intuit their interfaces instead of just looking them up. It's more complex, harder to learn, and much less powerful. What I want is very reliable, accurate voice detection with a well-designed, composable interface.
I want to treat Alexa like a computer, not like a person. People are not the most convenient interfaces to interact with, computers are better to interact with for me than people are.
I understand that different people are in different positions, some people want to have a conversation, but you're not going to make a voice assistant that's good for me if you follow that goal. At some point the ecosystem needs to fracture and diverge so that normal people can use whatever NLP interface Google/Amazon/Apple spits out of their AI opaque-boxes, and people like me can use a voice interface that is designed around well-tested decades-old computer UX principles that have been proven to work well for power users.
My vision of a voice-operated utopia isn't treating Siri like a person, it's on-the-fly composing a complicated new task by voice that is saved for later use. It's using timers as a trigger for other tasks with some kind of pipe command so I can tell Siri to send an email after ten minutes, or so I can have Siri look up a search and seamlessly pipe the result into a some kind of digital notebook.
My issue is I want to open a "voice shell". I want the same obvious commands but with more rapid responses and without needing to say "Hey Siri, X" every time. There should be a natural way to enter a mode where the back and forth is still not natural language but voice.
This is another area where I feel like the good answer is not necessarily the technically complicated one.
Saying "hey Siri" is fine if I'm in bed or in the shower, I don't need quick access to a shell in those places necessarily. That's fine to have as a backup. But for normal operation, if I'm wearing a smartwatch, it will pretty much always be more convenient and faster for me to tap and hold on that watchface than it will be for me to say "hey Siri".
I mean, that's a boring answer, but there's also a reason why my computers have buttons. I wouldn't want to use my phone because that's in another room or in my pocket. But a watch will always be reachable in less than a second, and the modern watches are waterproof, and I don't need to look at anything to use it -- I can just tap my watchface and start talking. And if my hands are dirty, or I'm carrying groceries, or I'm in bed, falling back to "hey Siri" isn't the end of the world in those scenarios.
In practice, when I see people interact with voice assistants today, they stop what they're doing, they give the command, they listen for a confirmation, and then they start what they're doing again. The biggest bottleneck there for their speed is precision -- they intuitively know that they need to stop what they're doing and optimize for the device. The precision, and the delays that are built into the UX to confirm what's happening -- that's the bottleneck. So if there's an operating mode that is just as fast and way more precise, we should just do that, we don't need to use voice triggers 100% of the time.
Bonus points if we're wasting processing time for a voice assistant to make a round trip and process the audio clip to try and figure out who's speaking. The person who pressed their watch is speaking, boom, we can get rid of that response delay now. How much time are we wasting trying to come up with wake words that optimize for both speed and precision -- when using wake words only as a fallback would allow us to make them more precise because they could be longer, more deliberate phrases?
Alexa will shut up immediately if you say “Alexa shut up” or “Alexa SILENCE” when she’s talking. We got an echo dot which came with our Alexa enabled microwave, and now that I have it set up to stream Apple Music, we use it all the time. Our thermostat also has it although when we change the temperature she’ll sometimes inexplicably change the house temp to 60°F which we won’t realize until 3am when we’re freezing in bed.
Well it’s also an oven, I often say “Alexa preheat the oven to 400” or “Alexa air fry for 10 minutes”. For soups instead of having them explode I say “Alexa microwave at power level 3 for 9 minutes” which avoids the exploding soup problem. Of course there’s also “bake at 400 for 30 minutes”.
I also have a five year old who is pretty competent at talking to Alexa even though she can’t intuit what the button controls mean on the microwave oven.
As someone that has last used microwaves on a regular basis when they just had two dials, I could see the use for Alexa to figure out how the more more modern ones with like 40 buttons work.
The first time I tried the office kitchen microwave, I had to ask someone from accounting standing around, how to heat my cup, because that stupid thing just responded with a condescending beep to whatever I pressed :/
Was this a microwave where "hit digits to type in a time, then hit start" doesn't work? If yes then yikes. If no then you only have to learn it once and trying to use alexa sounds much more complicated.