Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

There was a project out of MIT CSAIL back in 2006 that did automated extraction of tabular data from web pages. e.g. product lists on a store site. It recognized pagination and looked for a sequence repeated DOM structures (and what varied in them) to identify the items. You might find it interesting:

https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.90....



"We propose that web sites can be similarly augmented with other sophisticated data-centric functionality, giving users new benefits over the existing Web." - gonna check this paper out!

Reminds me also of this amazing project that also deals in structured data and tables: https://www.geoffreylitt.com/wildcard/




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: