I'm curious if a multimodal model would be better at the OCR step than tesseract...

zerojames · on Aug 9, 2024

I have seen excellent performance with Florence-2 for OCR. I wrote https://blog.roboflow.com/florence-2-ocr/ that shows a few examples.

Florence-2 is < 2GB so it fits into RAM well, and it is MIT licensed!

On a T4 in Colab, you can run inference in < 1s per image.

eigenvalue · on Aug 9, 2024

This looks good, I will investigate integrating it into my project. Thanks!

myth_drannon · on Aug 9, 2024

I couldn't find any comparisons with Microsoft's TrOCR model. I guess they are for different purposes. But since you used Florence-2 for OCR, did you compare the two?

barrenko · on Aug 9, 2024

This is pretty cool, when checking how Microsoft models (then) stacked against Donut, I chose Donut, didn't know they published more models!

artyomxyz · on Aug 9, 2024

I don't want to jump to conclusions, but I don't feel confident using gpt4o/claude for OCR, as I often experience issues mentioned on this page https://github.com/Yuliang-Liu/MultimodalOCR

[edit] But it is not applicable to OCR specialised models like Florence-2

davedx · on Aug 9, 2024

IME GPT-4V is a lot better than Tesseract, including on scanned document PDFs. The thing about frontier models is they aren’t free but they keep getting better too. I’m not using tesseract for anything anymore, for my tasks it’s obsolete.

jacooper · on Aug 9, 2024

Well, unless you care about the privacy of your documents.

daemonologist · on Aug 9, 2024

My experience is that at least the models which are price-competitive (~= open weight and small enough to run on a 3/4090 - MiniCPM-V, Phi-3-V, Kosmos-2.5) are not as good as Tesseract or EasyOCR. They're often more accurate on plain text where their language knowledge is useful but on symbols, numbers, and weird formatting they're at best even. Sometimes they go completely off the rails when they see a dashed line or handwriting or an image, things which the conventional OCR tools can ignore or at least recover from.

jimmySixDOF · on Aug 12, 2024

Did you test the MiniCPM (v2.6) released last week ? It was able to extract (and label) most complex examples I gave it on their huggingface space:

https://huggingface.co/spaces/openbmb/MiniCPM-V-2_6

_1 · on Aug 9, 2024

I found Claude3 great an reading documents. Plus it can describe figures. The only issue I ran into was giving it a 2-column article, and if reading the first line on each column "kinda made sense" together it would treat the entire thing as 1 column.