For what it's worth, I worked around this by first calling Google's older OCR ap...

For what it's worth, I worked around this by first calling Google's older OCR api (Cloud Vision), which gives the coordinates and bounding boxes for each word in the document. Then I pass this whole response to Gemini, and it's finally able to give me much more reasonable bounding boxes. Clearly overkill, and would be ridiculously expensive to do at scale, but works for me.