I still think you're missing the point. The idea is that you should use vision APIs and LLMs to build traditional browser automation using a DSL or Python.
I don't want to use vision and LLMs for every page. I just want to use vision and LLMs to figure out what elements need to be clicked once. Or maybe every time the site changes the frontend.
The AI would be a compiler that generates the traditional scraper / integration test.
It would save all that long time spent going manually thought every page and figuring out which mistake we did, when that input string doesn't go into that input field or the button on the modal window is not clicked.
I didn’t check the code but there would be a few good ways to specify what you want:
* browser extension that lets you record a few actions
* describing what you want to do with text
* a url with one or two lines of desired JSON to extract
No, that's something completely different than what bravura is talking about, which is why he made a comment to say explicitly that he still thinks you're missing the point.
From your roadmap:
> Prompt Caching - Introduce a caching layer to the LLM calls to dramatically reduce the cost of running Skyvern (memorize past actions and repeat them!)
Adding a caching layer is not what they're asking for. They want to periodically use Skyvern to generate automation code, which they could then deploy themselves in their testing/CI setup. Eventually their target website may make breaking UI changes, then you use Skyvern to generate new automation code. Rinse and repeat. This has nothing to do with an internal caching layer within your service.
We've discussed generating automation code internally a bunch, and what we decided on is to do action generation and memorization, instead of code generation and memorization. They're not that far apart conceptually, but there is one important distinction: The generated output would just be a list of actions and their associated data source.
For example, if Skyvern was asked to log-in to a website and do a search for product X, the generated action plan would include:
1. Click the log in button
2. Click "sign in with email"
3. Input the email address retrieved from source X
4. Input the password retrieved from source Y
5. Click log in
6. Click on the search bar
7. Input the search term from source Z
8. Click Search
Now, if the layout changed and suddenly the log-in button had a different XPath, you have two options:
1. Re-generate the entire action plan (or sub-action plan)
2. Re-generate the specific component that broke and assume everything else in the action plan still works
I don't want to use vision and LLMs for every page. I just want to use vision and LLMs to figure out what elements need to be clicked once. Or maybe every time the site changes the frontend.