Thanks for this post — I'm doing something similar for a personal/hobby project ...

svat · 2025-07-17T13:00:28 1752757228

For what it's worth, I worked around this by first calling Google's older OCR api (Cloud Vision), which gives the coordinates and bounding boxes for each word in the document. Then I pass this whole response to Gemini, and it's finally able to give me much more reasonable bounding boxes. Clearly overkill, and would be ridiculously expensive to do at scale, but works for me.

serjester · 2025-07-10T14:25:17 1752157517

I suspect this is a remnant of how images get tokenized - simplest solution is probably to increase the buffer.

svat · 2025-07-10T16:33:34 1752165214

What buffer are you referring to / how do you increase it? And did that solution work for you (if you happened to try)?