Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> Is it due to something inherent in PDF technology?

Exactly. PDF doesn't have instructions to say "render this paragraph of text in this box", it has instructions to say "render each of these glyphs at each of these x,y coordinates".

It was never designed to have text extracted from it. So trying to turn it back into text involves a lot of heuristics and guesswork, like where enough separation between characters should be considered a space.

A lot also depends on what software produced the PDF, which can make it easier or harder to extract the text.



My favorite is when they do bold by duplicating and slightly shifting the letters. Bboolldd. PDFs are hell.


That's inherited from the original Portable Document Format for machines - the typewriter instructions.


I've never looked into the PDF format, but, does it not allow for annotations that say, "the glyphs in the rectangle ((x0, y0), (x1, y1)) represent the text 'foobar'")? That's been my mental model for how they are text-searchable.


They do but such annotations are optional.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: