Hacker News new | past | comments | ask | show | jobs | submit login

What are you using for server side dom manipulation? jsdom? apricot? node-xml? libxmljs? I spent a lot of the weekend working on a webcrawler, but couldn't find a xml parser that didn't choke on the internet-at-large.

Any chance you'd consider open-sourcing?




I haven't seen their code, but I'm not sure why you'd need server side DOM manipulation for this. I'd implement all of that in the browser, and just let the server handle passing events back and forth.


Since they're proxying the page, they replace all <a>s with proxy <a>s, as well as adding a <script> and a <div> with some content to the bottom. Try viewing source on http://starcraft2destroyedmymarrage.no.de:3000/app/4


Do you have any examples of the kinds of things that libxmljs choked on?


I don't recall the urls - I was recursively crawling from a user-provided seed url and had trouble with apricot, node-xml, and libxmljs. I had the least time to play w/ libxmljs because of hassle w/ joyent --- I had to compile v8 and node to link against and scons wasn't playing nice w/ the build environment.

--

Edit: I should add that I'm using libxmljs in production (http://newsbasis.com/news) as an rss parser (streaming sax-push-parser) and it works quite well for that sort of xml.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: