> Unfortunately Gemini really seems to struggle on this, and no matter how we tr...

> Unfortunately Gemini really seems to struggle on this, and no matter how we tried prompting it, it would generate wildly inaccurate bounding boxes

Qwen2.5 VL was trained on a special HTML format for doing OCR with bounding boxes. [1] The resulting boxes aren't quite as accurate as something like Textract/Surya, but I've found they're much more accurate than Gemini or any other LLM.

[1] https://qwenlm.github.io/blog/qwen2.5-vl/