> like convert PDF bank statements into CSV transaction files
I've tried this recently and it's surprisingly difficult. Any pro-tips?
Extracting pdf tables, while respecting the cell position, seems almost impossible in a way that works in all cases (think borderless tables, whitespace cells, etc)
It is remarkably difficult and continues to provide a good example of the limitations of LLM based systems.
In my case, I used perl, and exploited the fact for for a given bank, the statements are consistently formatted. Further, PDF OCR conversion responds consistently to the documents with the same formatting. With this combination, it is possible to extract the characters and numbers that are associated with transactions from the document, and then to take those extracted bundles of text and transform them into lines for a CSV file.
The caveat is that it works for only that bank, that "kind" of account (usually checking, credit card, or savings), and when using that specific document OCR tool. Within those constraints it is eminently reliable but utterly non-transferable to a general case.
I've tried this recently and it's surprisingly difficult. Any pro-tips?
Extracting pdf tables, while respecting the cell position, seems almost impossible in a way that works in all cases (think borderless tables, whitespace cells, etc)