Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

What's the simple explanation for why these VLM OCRs hallucinate but previous version of OCRs don't?



Traditional OCR's usually have detection + recognition pipeline. So they will detect every word and then try to predict the text for every word. Errors obviously can happen in both parts, eg some words not detected which will get missed from output. Or word recognized incorrectly which is also common and more comparable to hallucination. However give its trained to work only on a small patch, accuracy is often higher. Comparing this to VLM's, they are looking at entire image/context and auto-regressively generating tokens/text which can also have lot of language bias, hence hallicinations.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: