Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Thanks for this post — I'm doing something similar for a personal/hobby project (just trying to work with very old scanned PDFs in Sanskrit etc), and the bounding box next to "Sub-TOI" in your screenshot (https://www.sergey.fyi/images/bboxes/annotated-filing.webp) is like something I'm encountering too: it clearly “knows” that there is a box of a certain width and height, but somehow the box is offset from its actual location. Do you have any insights into that kind of thing, and did anything you try fix that?


For what it's worth, I worked around this by first calling Google's older OCR api (Cloud Vision), which gives the coordinates and bounding boxes for each word in the document. Then I pass this whole response to Gemini, and it's finally able to give me much more reasonable bounding boxes. Clearly overkill, and would be ridiculously expensive to do at scale, but works for me.


I suspect this is a remnant of how images get tokenized - simplest solution is probably to increase the buffer.


What buffer are you referring to / how do you increase it? And did that solution work for you (if you happened to try)?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: