Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I'm curious if a multimodal model would be better at the OCR step than tesseract? Probably would increase the cost but I wonder if that would be offset by needing less post processing.


I have seen excellent performance with Florence-2 for OCR. I wrote https://blog.roboflow.com/florence-2-ocr/ that shows a few examples.

Florence-2 is < 2GB so it fits into RAM well, and it is MIT licensed!

On a T4 in Colab, you can run inference in < 1s per image.


This looks good, I will investigate integrating it into my project. Thanks!


I couldn't find any comparisons with Microsoft's TrOCR model. I guess they are for different purposes. But since you used Florence-2 for OCR, did you compare the two?


This is pretty cool, when checking how Microsoft models (then) stacked against Donut, I chose Donut, didn't know they published more models!


I don't want to jump to conclusions, but I don't feel confident using gpt4o/claude for OCR, as I often experience issues mentioned on this page https://github.com/Yuliang-Liu/MultimodalOCR

[edit] But it is not applicable to OCR specialised models like Florence-2


IME GPT-4V is a lot better than Tesseract, including on scanned document PDFs. The thing about frontier models is they aren’t free but they keep getting better too. I’m not using tesseract for anything anymore, for my tasks it’s obsolete.


Well, unless you care about the privacy of your documents.


My experience is that at least the models which are price-competitive (~= open weight and small enough to run on a 3/4090 - MiniCPM-V, Phi-3-V, Kosmos-2.5) are not as good as Tesseract or EasyOCR. They're often more accurate on plain text where their language knowledge is useful but on symbols, numbers, and weird formatting they're at best even. Sometimes they go completely off the rails when they see a dashed line or handwriting or an image, things which the conventional OCR tools can ignore or at least recover from.


Did you test the MiniCPM (v2.6) released last week ? It was able to extract (and label) most complex examples I gave it on their huggingface space:

https://huggingface.co/spaces/openbmb/MiniCPM-V-2_6


I found Claude3 great an reading documents. Plus it can describe figures. The only issue I ran into was giving it a 2-column article, and if reading the first line on each column "kinda made sense" together it would treat the entire thing as 1 column.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: