Hacker News new | past | comments | ask | show | jobs | submit login

Well, sure - what's he expecting? PDF is a low-level, mostly unstructured display format. Converting that to fully semantic markup that recognizes all aspects of high-level document structure is probably an AI-complete problem.

For those of you not familiar with the gory details of PDF, it basically uses absolute positioning for each character. If we converted that directly into HTML it would be a disaster. So, we actually extract quite a bit of structure on top of that, recognizing spaces, lines, columns, and paragraphs, which enables us to write much cleaner HTML. Scribd (and most PDF readers) does this with heuristic algorithms that make reasonable guesses.

In the future, we'd like to push those algorithms further, and extract ever more semantic markup. But this is a "nice to have" for us - mostly, people just want the documents to display correctly and load quickly. And, anyway, expecting the output of an automated converter to match what a human would write shows a basic ignorance of the state of computers and AI.




In my opinion, web based documents should have certain level of semantic purity. Of course there will be technical difficulties to achieve it, but that is what hackers are for. Right?

While you have the right to justify the lack of semantic markup in Scribd html docs, he has the right to expect a non-messy markup underneath.

Calling someone's expectation "ignorance" isnt very fair.


The author of the article is being ignorant by not understanding the basic differences between PDF and HTML before going off on a rant about it.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: