I've seen so many Alexa projects at hackathons that were exciting at first glance but didn't go anywhere.
My favourite is event recommendations: lots of people try building an Alexa thing to help you find events, but it turns out listening to a bot read out a list of 20 things happening this weekend is way less useful than browsing a web page with the same information all on a screen at once.
Amazon's failure is not realising that Alexa's best use is an universal remote control.
The interface with smart devices is good but clunky. While I can say "Alexa open my curtains" and I can go to the slow and bloated Alexa app to open the curtains at a certain hour, I cannot say "Alexa open my curtains tomorrow at 6:45am".
I would love an app that would allow better automation, like doing an action at a certain time and playing an alarm X minutes later. However, skills are voice-only and nearly useless.
> I cannot say "Alexa open my curtains tomorrow at 6:45am".
Works for me; however I often say the time before the action, so something like: “Alexa, at 8pm turn on the living room lights”, but “Turn the heating off in 15 minutes” also works for me.
It then creates a timer to activate said device/group at the time requested. (Edit: I also use commands like "Alexa, turn on the lighthouse lamp until 11:30pm", which turns on that lamp and sets a timer to turn it off at 11:30pm)
However all my devices are using the native smart home stuff in Alexa and they are exposed to Alexa via home assistant (So dunno if HA is doing some magic sauce to expose the devices in a certain way that allows such voice commands), So your mileage may (and indeed appears to) vary.
Granted sometimes it mistakes my intentions and I need to repeat myself.
I’ve also been able to create routines via voice, I first discovered that when Alexa suggested to me it could create the automation for me after it perform a request.
Another Edit: When saying "At X do Y" Alexa can only do that action if its within the next 24 hours, guessing thats its timer limit. I also tested saying "Alexa, Every day at 6pm turn on living room lights" and it created a routine for that action, I was then able to disable that routine by saying "alexa delete the living room lights routine", however it just disabled it rather than deleting it like I had requested (and checking the transcript in the app it did pick up i said the word delete and not misheard me).
> I cannot say "Alexa open my curtains tomorrow at 6:45am"
The "funny" thing is Google Home could do this (Turn on/off device after 8 hours) but recently they've removed the option & the assistant refuses. I assume it's for liability purposes, but it feels stupid.
Maybe its my imagination, but Google Home could recognize a lot of things that it just can't do or can't do well anymore. It gives the impression that they just downgraded the whole infra behind it to run an inferior, cheaper model, hoping people will eventually stop using it so they can pull the plug altogether.
If I asked me Echo to close my curtains, it would respond that it doesn’t know about a device with that name. If I then ask it what devices it does know about, it won’t give me an answer.
Will the Google Home tell you the device names it can see?
I don't think it can do that. Asking variations of "what devices/controls/lights can you see/access/control" just gives a google search result for it. Quite stupid flaw imo in the design.
If 'Alexa, open my curtains' already works for you, then creating a routine containing the same command to be triggered daily at the needed time should solve your problem.
But I cannot create a routine or change its time with my voice.
I can say "Alexa, set my alarm tomorrow at 7 in the morning". I wish I could say "Alexa, open my curtains at 6:45 in the morning tomorroe" instead of having to go to my cellphone and use the clunky Alexa app.
It just seems like a wasted opportunity for a routine. I'm sure a lot more things would be available if Amazon let developers use more features.
Seems like it'd get unwieldy fast. Would need an efficient interface for listing and editing registered tasks, or you might get stuck with a couple of hundred tasks of the type 'Alexa, say fart fart fart at three/four/five in the morning' if a kid ever where to pass through the house.
We're not asking to construct a complex schedule via voice command, the app is sufficient for that. Sometimes you just want your curtains opened at 6:45 tomorrow because something is happening and you just thought of it. That's would actually be convenient.
And it's not even possible to enable this with Alexa routines, which should work as a "hobbyist/enthusiast" version of skills development, but are in reality extremely limited.
e.g. As a routine writer, you could include a command within a routine to turn on lights at 6am. However, it would be fixed at 6am whenever the routine runs, with no flexibility since there's no concept of variables or voice input available to routines. A more sophisticated platform would offer more flexible commands, i.e. "turn on lights at {voice_input}".
The core problem lies in latency and information bandwidth - for which eyes (which sees the “gestalt”) are just far superior to reading or listening to streaming text (which is linear).
The one main situation where NL interfaces are superior is when you are mobile (like driving) or hands are tied up.
I think this affects GPTs just like it did Alexa. Which means that GPTs aren’t the final UI. The real innovation will be in the right AI UX.
The core problem is that these systems are just so incorrect in fundamental ways that they're effectively useless.
Imagine a buddy of yours tells you about an event he's pretty sure you'll be interested in. Why does he tell you about this event? Well, he knows your interests, what kind of things you enjoy, when you're free, who you might want to go to the event with, how much money you're willing to spend, how far you're willing to travel, when you like to go out... So when you're on the receiving end of such a suggestion it often feels great! It's like you've struck gold.
Now imagine your average 'AI' powered recommendation engine reading you a list of events. It doesn't feel magical. It doesn't even feel like it knows what the hell you enjoy doing half the time. Forget about knowing about your free time, budgetary restrictions, family restrictions, who you'd be able to go with; None of that stuff is even sort of in the picture. And it's all delivered to you in a voice that sounds like it would be as happy to kill you as give you advice. There's no lively back and forth on the logistics of the event. No feeling of discovery as you two talk it out, honing the plan that brings it from an abstract concept to reality.
I agree with you, and I cringed a little when I read the following from the OP:
> There are some fundamental reasons why conversational 3rd party platforms are hard.
In my mind the big fundamental problem here is the "3rd party". I'd love to have an "AI assistant" or an "AI buddy" that could watch everything I do and say and write and really get to know me super well... as long as I can be confident that I own and control everything it observes and learns. I sure as hell don't want a 3rd party involved! But alas, I don't see a way we get there that doesn't involve Amazon or Meta or Google or OpenAI sitting between me and my "AI" tools, at least in the short run.
Fwiw this is what I assume apple's long-term ai strategy is.
Let the hype-funded unicorns fight to develop (& end up commodifying) the tech and then design/sell selling devices that can support it locally. In that world, the AI assistant that you buy is a discrete piece of hardware, rather than a software treadmill.
Of course, this could mean that you end up on a hardware treadmill, but I think that's probably less bad, granted we can do something about the e-waste.
You're comparing a person recommending a single event versus an AI providing a list. In other words, proving OP's point.
GUIs provide information in 2D, letting eyes skim and bypass information that's not useful.
VUIs provide information in 1D, forcing you to take information in a linear stream and providing weak at best controls for skipping around without losing context or place.
Not coincidentally, this is why I absolutely hate watching videos for programming and news. If it's an article or post or thread, I can quickly evaluate if it has what I want and bypass the fluff. Videos simply suck at being usable, unless it's for something physical like how to work on a particular motor or carve a particular pattern into wood.
The person you are replying to is arguing the opposite - a future VUI could know your interests and just read you out the one relevant event rather than reading a list.
Alternatively it could summarise into “there’s a standup comedy gig, a few bands and various classes - what sort of thing are you looking for?” and then discuss with you to find the 2-3 events that are most relevant rather than reel off one big list.
And if it were a visual display of those things, I would have honed in on what I wanted and gotten my answer in the time it took me to ask what I was looking for.
It may be fantastic as an aid for low and no sighted people, but so long as I can read, a VUI is strictly inferior.
I would still prefer it to send it to me on my mobile device display. Voice interaction is nice for accessibility, bu the first method of control (whatever it is) is faster.
The point I'm trying to make is that this thing we're calling a 'VUI' is shit. There's no reason speech has to be this boring one dimensional thing. It's like the people that designed these things have never had a real conversation in their lives. When you're speaking with another person, or multiple other people, you're constantly exchanging cues that allow the other person to understand and re-calibrate what they're saying. These are verbal sounds, non-verbal sounds and physical movements. A crinkle of the forehead, a shake of the head, an uttered 'aaaaah' or a quiet verbal affirmation in support of what's being stated. It's not a single uni-directional stream of information, it's a multi-directional stream coming from multiple multi-modal sources at the same time.
None of these basic realities are accounted for in current technology. Instead we have these dumb robot voices reading us results from a preprocessed script that it thinks answers our question. No wonder the monkey part of our brain immediately picks up on the fact that this whole facade isn't just a lie, but an excruciating lie. It's excruciating because it's immediately obvious that there's nothing else 'there' to interact with. Even when speaking to another person over the phone, there's a huge amount of nuance you can pick up on. Are they happy? Are they sad? Are they frazzled? Are they in a rush? Are they relaxed? And you automatically calibrate your responses and what you say in the conversation based on all of these perfectly obvious things. Normal humans automatically calibrate what they say, how they sound, what they suggest based on these cues. It works really well!
There's no reason voice stuff has to suck. It has worked pretty great for humans for thousands of years. We're evolutionarily tuned to it. It's just that all the technology we've created around it totally sucks and people are delusional if they think it's anywhere near prime time.
This is all technically possible, but then there's also the privacy/security aspect. Many who would actually be into a solution like this won't be too hot to share the necessary information in the first place. And with good reason: companies with the resources to provide a decent experience don't have the best track record of protecting user data, and ensuring only the user has control of it. The privacy conscious would rather self-host, and end up losing out on capabilities in the process. So it's a sort of catch-22.
Part of the big challenge here in my mind is that companies are reluctant to put data into the world for others to consume in a friendly way - if, say, event organisers, put data out as open-API's, there would be the opportunity for a self-hosted or "convenient" third party (ala Google, Amazon) to create conversational experiences on top of it - private user and privacy uninterested user is well served (as is "event seller" as it's exposed to more people). But, as long as we're stuck with systems having to pull data by web scraping, no one can build a good solution that could work for either scenario.
I'm the case of events, I'm not so sure. Ticketing platforms want to drive sales via easy discovery and wide distribution. Take Eventbrite as an example (disclaimer: I used to work there). They
> Now imagine your average 'AI' powered recommendation engine reading you a list of events.
I think the issue is that having AI just fetch information catered towards any human is not using AI at all. I'm sure the hackathon groups pitching it all started with an idea of building a highly trained AI system whereby the recommendations are meaningful reflections of whatever information it has from you. Unfortunately for them, the most lucrative part of their plan neglected problems with both how to create an AI pipeline that takes many piecemeal inputs, along with millions having missing values represented some way, and renders meaningful outputs and neglected that success would only reap a massive backlash from privacy advocates.
In the end, their plan for a super-intelligent life assistant turns into just fetching event lists from facebook (or elsewhere) without even using the demographic data it has.
I agree, the real deal would be an AI assistant that actively learns and remembers everything important about you, and can access and utilize all that information when having a conversation, while also having access to external services, your calendar and so on.
I wouldn't be comfortable running that as a cloud-service though. Should be open source and run at home on my own machine.
Right, because if someone else is running it, you can be sure you're getting served thinly veiled ads, something someone else wishes you'd want, rather than something you really like.
To a point yes, but also comprehension isn't there - I have alexa and a goodly amount of automation in my home, and it's job 99% of the time? Let me turn the lights off after I'm cozy in bed, set timers for the kitchen, and tell me the weather forecast for the day. Things where its interface is the appeal, I can do all those things without pulling out my phone, getting up for a switch, or stopping from getting dressed.
That is the value, not in having conversations with the bot -- google got closer with its assistants, but only because it has a creepily deep profile on its users, so conversations with google had more 'context' than alexa ever could.
I tried to ask alexa for the weather yesterday (I wanted to know if it had rained overnight to decide which shoes to wear to walk the dog), and it first gave me todays weather, then told me it only knew weather for the next 14 days - yesterday was too hard to predict I guess?
But that's the point - simple tasks where the interface is circumstantially superior? Awesome. But if I want to just chat with my computer, that novelty wore out with Eliza & Dr. SBATSO. The conversations with ChatGPT are deeper, but no more meaningful.
So what can alexa do to make my life simpler? Don't read me wikipedia pages in response to -anything-, if you can't summarize it in two sentences, say it's a long answer and you can send it to my phone if I'd prefer. Make the interactions short and sweet. Control the lights, make me coffee, walk my dog, order more coke. I don't need it to have "new skills" - I just want it to be better at the ones I actually want to use.
One of their best devices appears to in the "Temporarily out of stock" category now. The echo wall clock. It pairs with an echo device and provides a visual representation of the count down timer.
This reminds me of Ambient in the design philosophy.
With push (notifications), it's interrupting. Alexa isn't too bad about that since it's a ring color / icon on a screen. With pull, it's "I need to fetch this data". With ambient, it's there if you want it when you want it.
I don't need to ask how much longer on the timer with the clock - it's there at a glance.
Unfortunately, Amazon has been making the devices (especially ones with a screen) into an advertising channel to the point I'm looking at replacing the various echo show devices with just echos (and a clock if they ever come back into stock).
On reading Wikipedia... Alexa used to have a knowledge engine somewhere in its code. You could ask it what color a black cat was and get back "a black cat is black." You could ask "what color is a light red flower" and get back "a pink flower is pink." Asking "what color is a blue bird" gets back "a blue bird is blue, brown, and white." That hinted at a deeper knowledge engine. There was also an inventor <-> invention knowledge base. One time I even had it return back part of the query language by asking it if two people (who were born on the same day) were born on the same day. The knowledge base functionality appears to have been delegated to "search Alexa answers".
It has gotten decidedly worse over the years as useful functionality that had no revenue associated with it got removed while "revenue enacting" features (pushing fire tv, product advertisements and such) have been prioritized.
Many of my echo shows are now "face down" because the screen cycle of stuff I do not care about is out of the corner of my eye distractingly fast. Time and weather are better served by my watch now.
It still does timers and reminders acceptably well.
>The core problem lies in latency and information bandwidth
I actually think the core problem is that Amazon just isn't competent at deriving insights from consumer behaviour data.
If I buy a vacuum cleaner on Amazon, based on the Amazon web store recommendations I fully and without a shred of sarcasm expect Alexa to think that I've developed a vacuum collecting habit and recommend a vacuum conference if I ask it for events.
The problem with that this is that this is the objectively "correct" recommendation. Some who just bought one obviously had an interest in one and people do return products to get different/better ones.
If it looks stupid but it works.... (Amazon has all the data to check that it indeed works)
For soft goods, maybe. But for durable goods it is objectively the wrong decision. They ought to know if you have initiated a return, and act on it, but they don't.
What percentage of Amazon shoppers make repeated, back-to-back purchases of washing machines or kitchenaid mixers? I'm certain it's vanishingly small.
What percentage of people who just bought a kitchenaid mixer would be interested in baking pans, or whatever? Probably more. But if you buy a kitchenaid mixer, tHe AlGoRiThM just sends you ads for more kitchenaid mixers.
Conversational interfaces are superior... when the counterparties have context.
In the parent situation, I don't want a GPT spoken interface to give me the top 20 events: I want it to give me 1-2 events that I am most likely to enjoy.
In the same way that actual conversations take into account tone, facial expression, etc. to jump straight to important information.
I thought that's where Google was going with their "we have all your data because we run all your services", but it seems they Microsoftified before they could get services cooperating for the larger good.
> Conversational interfaces are superior... when the counterparties have context.
Why? I find conversational interfaces poor for common data retrieval. I can read faster than you can speak. I can type faster than I can speak. I'm staring at a screen 14 hours a day anyway. Just show me the list of 20 events and sort it by what I am most likely to enjoy. Provide links for more information. Show me visual promotional materials. If I need to cross reference it with my calendar it's easier if all the information is visual.
When I got to this point in your comment, I remembered the Seinfeld episode where Kramer was recreating moviefone and tried to speak a trailer, with mouth-sound-effects.
“I want it to give me 1-2 events that I am most likely to enjoy.“
That's the core property of voice interfaces and I see surprisingly little awareness for that: it does not matter if it's a phone menu beep code tree or a GPT or a star trek ship's computer: the low bandwidth linearity of the readout will never go away.
This is what makes voice interfaces so hugely attractive for the "searchy advertisial complex": if you haven't bought enough ads that the almighty relevance algorithm (1) puts you in the top spot you're out. What used to be the first page on the web is the top spot in voice, second place is first loser. No amount of intelligence can ever change that, voice interface implies handing over control in an unprecedented way.
((1) technically, claims that result ranks are not sold aren't exactly lies, when result ranks don't go to the highest bidder. But that does not mean that ad spend isn't a contributing signal in any number of deeper layers so in the end results appear indistinguishable from highest bidder, only that buyers don't get any contractually guaranteed result list exposure for their money)
I don't want a GPT spoken interface to give me the top 20 events: I want it to give me 1-2 events that I am most likely to enjoy.
This is fundamentally impossible for a computer though, because even if a computer has perfect historical information about you it can't know some random things that would change your mind in the moment. For example, if you've been to every gig a band has done for years, but at the last one your girlfriend dumped you, a recommendation engine is still going to suggest that band's gigs even though it's unlikely you want to be reminded about them. To most users that immediately looks like a bad recommendation. If the system is only suggesting 1 thing then the whole system looks broken.
The only way around that is to increase the number of recommendations. Hence every system giving out 10+ options - because the people who make it want it to be slightly better than useless.
Not fundamentally impossible. Just as a friend may know not to recommend the gig as they're aware of the breakup, so can an AI. Heck after conversing with Pi a few weeks ago, I'm fairly convinced it'd at this point be able to handle that kind of nuance if it had access to the relevant data.
"I've recognized you removed all future calendar events related to {girlfriend} and your recent text messages concerning her had a negative sentiment. Did you break up?"
Not the world I'd want to live in... but for people less concerned about their data, I can't say it wouldn't be useful!
> For example, if you've been to every gig a band has done for years, but at the last one your girlfriend dumped you, a recommendation engine is still going to suggest that band's gigs even though it's unlikely you want to be reminded about them.
I mean sure, but just think about it, wouldn't the same happen if you have a friend telling you about the event? Or if you had an attentive concierge trying to organise programs for you? How would you like them to handle it? Not by blindly listing more programs that is for sure.
"Hey you love Blasted Monkeys. Did you know they are having a gig this weekend?"
"Nah, man. We had a bad breakup with Samantha at their last one. And besides it was really her thing and I was just tagging along."
"Oh, that's rough. I didn't know. Blast those monkeys then. How about a day a the beach then? There will be a surf class at ..."
This is the kind of interface a spoken event recommender should have. Is this much harder than just listing events? Yes, it is much harder. The problem is that if you don't go all the way then it falls into a weird uncanny valley. It feels like you are talking with a human, but a very stupid one.
conversational interfaces are superior as the input method but text and images is always superior as the output method (except for driving without FSD or another rare scenarios)
Yes the latency was annoying, but the "AI" part was severely lacking. You couldn't just say _anything_, it had to be formatted the way Alexa expected.
You also couldn't really interrupt it, it wasn't a "natural" conversation. I knew what it was going to say and just wanted to reply already, but it only listens when it finishes speaking.
It also talks incredibly slowly, you could 5x the playback speed and it would probably still be understandable.
If all those things were fixed, I'd say it could make an okay product. Not great, but at least okay.
With list on paper / screen I can compare things, go back & forth, do other stuff (like check my calendar) and the list is exactly where it was, trivial to pick up again.
Listening to it? What a waste of my time and focus on something trivially and already solved.
And use this during driving as some other mention? Sorry not a fan to say at least, it definitely impairs everybody who is driving to certain extent, humans simply don't have efficient parallel processing of things that require focus in this way.
I've seen so many Alexa projects at hackathons that were exciting at first glance but didn't go anywhere.
My favourite is event recommendations: lots of people try building an Alexa thing to help you find events, but it turns out listening to a bot read out a list of 20 things happening this weekend is way less useful than browsing a web page with the same information all on a screen at once.