Hacker News new | past | comments | ask | show | jobs | submit login

I don't know why, but the approach where "agents" accomplish things by using a mouse and keyboard and looking at pixels always seemed off to me.

I understand that in theory it's more flexible, but I always imagined some sort of standard, where apps and services can expose a set of pre-approved actions on the user's behalf. And the user can add/revoke privileges from agents at any point. Kind of like OAuth scopes.

Imagine having "app stores" where you "install" apps like Gmail or Uber or whatever on your agent of choice, define the privileges you wish the agent to have on those apps, and bam, it now has new capabilities. No browser clicks needed. You can configure it at any time. You can audit when it took action on your behalf. You can see exactly how app devs instructed the agent to use it (hell, you can even customize it). And, it's probably much faster, cheaper, and less brittle (since it doesn't need to understand any pixels).

Seems like better UX to me. But probably more difficult to get app developers on board.






> But probably more difficult to get app developers on board.

That's it. The problem is getting Postmates to agree to give away control of their UI. Giving away their ability to upsell you and push whatever makes them more money. Its never going to happen. Netflix still isn't integrated with Apple TV properly because they don't want to give away that access.

I'm not convinced this is the path forward for computers either though.


This is classic disruption vulnerability creation in real time.

AI’s are (just) starting to devalue the moat benefits of human-only interfaces. New entrants that preemptively give up on human-only “security” or moats, have a clear new opening at the low end. Especially with development costs dropping. (Specifics of product or service being favorable.)

As for the problem of machine attacks on machine friendly API’s:

Sometime, the only defense against attacks by machines will be some kind of micropayment system. Payments too small to be relevant to anyone getting value, but don’t scale for anyone trying to externalize costs onto their target (what all attacks essentially are).


Internet subscription, anyone? Access over 500 websites for $39.99 a month.

I am thinking cents per some small usage unit, refundable for operating per sites terms.

That convention, implemented well to distribute & decentralize spike impacts, would force any direct overuse attack to take on significant financial risk. While essentially not costing anyone else.

It might still be damaging to availability, but as a service provider I would rather get paid handsomely for periods of being too overwhelmed to service my legitimate customers than not.

The main benefit is having machine interfaces, but those kinds of attacks being heavily disincentivized.


And it's why you can't have a single messaging app that acts as a unified inbox for all the various services out there. XMPP could've been that but it died, and Microsoft tried to have it on Windows Phone but the messaging apps told them to get fucked.

Open API interoperability is the dream but it's clear it will never happen unless it's forced by law.


> I'm not convinced this is the path forward for computers either though.

With this approach they'll have to contend with the agent running into all the anti-bot measures that sites have implemented to deal with abuse. CAPTCHAs, flagging or blocking datacenter IP addresses, etc.

Maybe deals could be struck to allow agents to be whitelisted, but that assumes the agents won't also be used for abuse. If you could get ChatGPT to spam Reddit[1] then Reddit probably wouldn't cooperate.

[1] https://gizmodo.com/oh-no-this-startup-is-using-ai-agents-to...


> With this approach they'll have to contend with the agent running into all the anti-bot measures that sites have implemented to deal with abuse

I expect many more sites to adopt login requirements. This has the added benefit of more tracking/marketing data.


The solution is simple, and it's what's already done with search by proprietary LLMs: reasoning happens on the LLM vendor's servers, tool use happens client-side. Whether for search or "computer use", the websites will register activity coming from the user's machine, as it should be, because LLMs act as User Agents here.

Of course, already with LLM-powered search we see growing number of people doing the selfish/idiotic thing and blocking or poisoning user-initiated LLM interactions[0]; hopefully LLM tools following the practice above will spread quickly enough to beat this idea out of peoples' heads.

--

[0] - As opposed to LLM company crawlers that scrape the web for training data - blocking those is fine and follows the cultural best practices on the web, which have been holding for decades now. But guess what, LLM crawlers tend to obey robots.txt. The "bots" that don't are usually the ones performing specific query on behalf of users; such bots act as User Agents, neither have nor ever had any obligation to obey robots.txt.


APIs have an MxN problem. N tools each need to implement M different APIs.

In nearly every case (that an end user cares about), an API will also have a GUI frontend. The GUI is discoverable, able to be authenticated against, definitely exists, and generally usable by the lowest common denominator. Teaching the AI to use this generically, solves the same problem as implementing support for a bunch of APIs without the discoverability and existence problems. In many ways this is horrific compute waste, but it's also a generic MxN solution.


But if you have an AI then all that's needed to implement an api is documentation

> I always imagined some sort of standard, where apps and services can expose a set of pre-approved actions on the user's behalf

OS specific, but Apple has the Scripting Support API [0] and Shortcut API for their app. Works great.

[0]: https://developer.apple.com/documentation/foundation/scripti...


AppleScript support has sadly become more rare over time though, as more and more companies dig motes around their castles in effort to control and/or charge for interoperability. Phoned-in cross platform ports suffer this problem too.

Yep, and on Windows this is exposed through the COM api.

> the approach where "agents" accomplish things by using the browser/desktop always seemed off to me

It's certainly a much more difficult approach, but it scales so much better. There's such a long-tail of small websites and apps that people will want to integrate with. There's no way OpenAI is going to negotiate a partnership/integration with <legacy business software X>, let alone internal software at medium to large size corporations. If OpenAI (or Anthropic) can solve the general problem, "do arbitrary work task at computer", the size of the prize is enormous.


A bit like humanoid robotics - not the most efficient, cheapest, easiest etc, but highly compatible with existing environments designed for humans and hence can be integrated very generically

This is true, but what would make sense to me was if "Operator" was just another app on this platform, kind of like Safari is just another app on your iPhone that let's you use services that don't have iOS apps.

When iPhones first came out I had to use Safari all the time. Now almost everything has an app. The long tail is getting shorter.

You can even have several Operator-y apps to choose from! And they can work across different LLMs!


I am more interested in Gemini's "Deep Research" feature than Operators. As a ChatGPT subscriber I wish they'd build a similar product.

Even when it comes to shopping, most of the time I spend is in researching alternatives according to my desired criteria. Operators doesn't help with that. o1 doesn't help because it's not connected to the internet. GPT-4o doesn't help because it struggles to iterate or perform > 1 search at a time.


Disco it , currently Nordstrom catalog is LLM searchable

https://www.ddisco.com/


That's specifically what I'm working on at Unternet [1], based on observing the same issue while working at Adept. It seems absurd that in the future we'll have developers building full GUI apps that users never see, because they're being used by GPU-crunching vision models, which then in turn create their own interfaces for end-users.

Instead we need apps that have a human interface for users, and a machine interface for models. I've been building web applets [2] as an lightweight protocol on top of the web to achieve this. It's in early stages, but I'm inviting the first projects to start building with it & accepting contributions.

[1]: https://unternet.co/

[2]: https://github.com/unternet-co/web-applets/


If there are pre-approved standardized actions, it would be just be a plain old API; it would not be AGI. It's clear the AI companies are aiming for general computer use, not just coding against pre-approved APIs.

Naturally a "capability" is really just API + prompt.

If your product has a well documented OpenAPI endpoint (not to be confused with OpenAI), then you're basically done as a developer. Just add that endpoint to the "app store", choose your logo, and add your bank account for $$.


Actually I suspect that’s where companies like Apple are going. If you look at the latest iteration of app intents, Apple is trying to define a predefined set of actions that developers can implement in their app. In turn, Apple intelligence/siri pretty much can leverage said intent when the user prompt a given task. It’s still fairly early but I could see how this would indeed converge towards that sort of paradigm.

> but I always imagined some sort of standard, where apps and services can expose a set of pre-approved actions on the user's behalf

I sincerely hope it's not the future we're heading to (but it might be inevitable, sadly).

If it becomes a popular trend, developers will start making "AI-first" apps that you have to use AI to interact with to get the full functionality. See also: mobile first.


Why would developers do that?

The developer's incentive is to control the experience for a mix of the users' ends and the developer's ends. Functionality being what users want and monetization being what developers want. Devs don't expose APIs for the same reason why hackers want them - it commodifies the service.

An AI-first app only makes sense if the developer controls the AI and is developing the app to sell AI subscriptions. An independent AI company has no incentive to support the dev's monetization and every incentive to subvert it in favor of their own.

(EDIT: This is also why AI agents will "use" mice and keyboards. The agent provider needs the app or service to think they're interacting with the actual human user instead of a bot, or else they'll get blocked.)


Because Apple. Apple has the power over developers not the other way around, and it has shown quite strong interest in integrating AI into their products.

For example, by guiding your users to app instead of website, you immediately "lost" 30% of your potential revenue from them. On paper it sounds like something no one would every do. But in reality most developers do that.


Maybe there's a middle ground: a site that wants to work as well as possible for agents could present a stripped-down standardized page depending on the user agent string, while the agent tries to work well even for pages that haven't implemented that interface?

(or, perhaps, agents could use web accessibility tools if they're set up, incentivizing developers to make better use of them)


probably more difficult to get app developers on board.

You answered your own question. You have to build the ecosystem if you want to have the facilities your comment outlines.

Whereas the facilities are already in place for "Operator"-like agents.

Even better, it will be difficult for companies who object to users accessing their resources in this fashion to block "Operator"-like agents.


You could make a similar argument for self-driving cars. We would have got there quicker if the roads were built from the ground up for automation. You can try to get the world on board to change how they do roads. Or make the computers adapt to any kind of road.

I think the answer here speaks to the intentions of these companies. The focus is on having the AI act like a human would in order to cut humans out of the equation.

I think it's just another way of accessing anything that doesn't have a traditional API. Most humans interact with things through the world with a web browser, with a keyboard and a mouse, and so even places that don't have any sort of API can be supported. You can still probably use things that define tool use explicitly, but I think this is kind of becoming a general purpose tool-use of last resort?

The mouse and keyboard are definitely dying (very slowly) for everyday computing use.

And this kind of seems like an assistant for those.

ChatGPT voice and real-time video is really a beautiful computing experience. Same with Meta Ray Bans AI (if it could level up the real-time).

I'd like just a bulleted list of chats that I can ask it to do stuff and come back to vs watching it click things. E.g.: Setup my Whole Foods cart for the week again please.


> The mouse and keyboard are definitely dying (very slowly) for everyday computing use.

Not to be that guy, but where's the evidence for this? People have been telling us that voice interaction is the future for many, many years, and we're in the future now and it's not. When I look around -- comparing today to ten years ago -- I see more people typing and tapping, not fewer, and voice interactions are still relatively rare. Is it all happening in private? Are there any public metrics for this?




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: