Thanks man, starred yours too, it's super cool to see all these projects getting spun up!
I see Cerebellum is vision only. Did you try adding HTML + screenshot? I think that improves the performance like crazy and you don't have to use Claude only.
Just saw Skyvern today on previous Show HNs haha :)
I had an older version that used simplified HTML, and it got to decent performance with GPT-4o and Gemini but at the cost of 10x token usage. You are right, identifying the interactable elements and pulling out their values into a prompt structure to explicitly allow the next actions can boost performance, especially if done with grammar like structured outputs or guidance-llm. However, I saw that Claude had similar levels of performance with pure vision, and I felt that vision + more training would beat a specialized DOM algorithm due to "the bitter lesson".
BTW I really like your handling of browser tabs, I think it's really clever.
* Cerebellum (Typescript): https://github.com/theredsix/cerebellum
* Skyvern: https://github.com/Skyvern-AI/skyvern
Disclaimer: I am the author of Cerebellum