Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

So an invoice attached to an email as a PDF is sent digitally ... those unfamiliar with PDF will think text and data extraction is trivial then, but this isn't true. You can have a fully digital, non-image PDF that is vector based and has what looks like text, but doesn't have a single piece of extractable text in it. It's all about how the PDF was generated. Tables can be formatted in a million ways, etc.

Your best bet is to always convert it to an image and OCR it to extract structured data.




This is simply not true. Maybe it’s easier and you do not need 100% precision. But it is actually possible to extract text and layout of digital pdfs. Else it would be impossible to display it. Of course some people still add image fragments to a pdf, but that practice is basically dying. I did not see a single pdf the last year we‘re it was impossible to extract the layout.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: