Is there a OCR toolkit or a ML Model which is able to reliable extract tables fr...

CharlieDigital · 2025-01-10T14:58:40 1736521120

By far the best one I've come across is Microsoft Azure Document Intelligence with the Layout Model[0].

It's really, really good at tables.

You have to use the Layout Model and not just the base Document Intelligence.

A bit pricey, but if you're processing content one time and it's high value (my use case as clinical trial protocol documents and the trial will run anywhere from 6-24 months), then it's worth it, IMO.

[0] https://learn.microsoft.com/en-us/azure/ai-services/document...

benpacker · 2025-01-10T13:04:58 1736514298

All frontier multi modal LLMs can do this - there’s likely something lighter weight as well.

In my experience, the latest Gemini is best at vision and OCR

michaelt · 2025-01-10T14:48:05 1736520485

> All frontier multi modal LLMs can do this

There's reliable, and there's reliable. For example [1] is a conversation where I ask ChatGPT 4o questions about a seven-page tabular PDF from [2] which contains a list of election polling stations.

The results are simultaneously impressive and unimpressive. The document contains some repeated addresses, and the LLM correctly identifies all 11 of them... then says it found ten.

It gracefully deals with the PDF table, and converts the all-caps input data into Title Case.

The table is split across multiple pages, and the title row repeats each time. It deals with that easily.

It correctly finds all five schools mentioned.

When asked to extract an address that isn't in the document it correctly refuses, instead of hallucinating an answer.

When asked to count churches, "Bunyan Baptist Church" gets missed out. Of two church halls, only one gets counted.

The "Friends Meeting House" also doesn't get counted, but arguably that's not a church even if it is a place of worship.

Longmeadow Evangelical Church has one address, three rows and two polling station numbers. When asked how many polling stations are in the table, the LLM counts that as two. A reasonable person might have expected one, two, three, or a warning. If I was writing an invoice parser, I would want this to be very predictable.

So, it's a mixed bag. I've certainly seen worse attempts at parsing a PDF.

[1] https://chatgpt.com/share/67812ad9-f2bc-8011-96be-faea40e48d... [2] https://www.stevenage.gov.uk/documents/elections/2024-pcc-el...

numba888 · 2025-01-10T22:34:51 1736548491

You can try to ask it to list all churches and assign them incremental number starting with 1. then print the last number. It's a variation of counting 'r' in 'raspberry' which works better than simple direct question.

dragonwriter · 2025-01-10T19:32:36 1736537556

> There's reliable, and there's reliable. For example [1] is a conversation where I ask ChatGPT 4o questions about a seven-page tabular PDF from [2] which contains a list of election polling stations.

From your description, it does perfectly at the task asked about upthread (extraction) and has mixed results on other, question-answering, tasks, that weren't the subject.

michaelt · 2025-01-10T21:00:32 1736542832

> From your description, it does perfectly at the task asked about upthread (extraction) and has mixed results on other, question-answering, tasks, that weren't the subject.

¯\_(ツ)_/¯

Which do you think was which?

NeedMoreTime4Me · 2025-01-10T18:24:16 1736533456

Do I understand correctly that nearly all issues were related to counting (i.e. numerical operations)? that makes it still impressive because you can do that client-side with the structured data

michaelt · 2025-01-10T21:29:44 1736544584

Some would say the numerical information is among the most important parts of an invoice.

philomath_mn · 2025-01-10T15:20:50 1736522450

I wonder if performance would improve if you asked it to create csvs from the tables first, then fed the CSVs in to a new chat?

ttt3ts · 2025-01-10T15:52:53 1736524373

https://github.com/microsoft/table-transformer

This is much lighter weight and more reliable than vllm

serjester · 2025-01-10T16:35:50 1736526950

As someone that spent quite a bit of time with table-transformers, I would definitely not recommend it. It was one of the first libraries we added for parsing tables into our chunking library [1] and the results were very underwhelming. This was a while back and at this point, it's just so much easier to use an LLM end to end for parsing docs (Gemini Flash can parse 20k pages per dollar) and I'm wary of any approach that stitches together different models.

[1] https://github.com/Filimoa/open-parse/

ttt3ts · 2025-01-11T18:05:54 1736618754

Do you have some benchmark results I can look at that compares results?

jonathan-adly · 2025-01-10T20:52:09 1736542329

I would like to through our project in the ring. We use ColQwen2 over a ColPali implementation. Basically, search & extract pipeline: https://docs.colivara.com/guide/markdown

m_ke · 2025-01-11T13:59:06 1736603946

Surya is a great open source toolkit for table parsing, layout analysis and OCR: https://github.com/VikParuchuri/surya