> Is it due to something inherent in PDF technology?
Exactly. PDF doesn't have instructions to say "render this paragraph of text in this box", it has instructions to say "render each of these glyphs at each of these x,y coordinates".
It was never designed to have text extracted from it. So trying to turn it back into text involves a lot of heuristics and guesswork, like where enough separation between characters should be considered a space.
A lot also depends on what software produced the PDF, which can make it easier or harder to extract the text.
I've never looked into the PDF format, but, does it not allow for annotations that say, "the glyphs in the rectangle ((x0, y0), (x1, y1)) represent the text 'foobar'")? That's been my mental model for how they are text-searchable.
Exactly. PDF doesn't have instructions to say "render this paragraph of text in this box", it has instructions to say "render each of these glyphs at each of these x,y coordinates".
It was never designed to have text extracted from it. So trying to turn it back into text involves a lot of heuristics and guesswork, like where enough separation between characters should be considered a space.
A lot also depends on what software produced the PDF, which can make it easier or harder to extract the text.