Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Scraperjs – A versatile web scraper (github.com/ruipgil)
192 points by ruipgil on Aug 18, 2014 | hide | past | favorite | 36 comments



It's unclear to me how to actually run this. Only executing the two commands listed under the Installing section does not run it - I had to `cd` into the scraperjs dir, then `npm install`, then continue with the second Install command (`grunt test`) to actually test.

Also, do you install scraperjs into each project directory you want to use it for? Or just install it once?


Scraperjs is supposed to be used as npm package. So, if you do "npm install <package-name>", you download the latest version of package to the same folder as the closest package.json file (if there's none it will go to your ~/ folder). At that point you can just use with "require('scraperjs')". The test part is a bit more foggy, and I'll add more information to the README in due time. To test you've got to npm-install, with the save-dev flag (npm install --save-dev scraperjs), it will also add the package to your development dependencies, this is so that people that want to use the package won't need to download all scraperjs' development dependencies.

For more information about npm install: https://www.npmjs.org/doc/cli/npm-install.html


It would be helpful if the documentation compared how Scraperjs is different from, or better than, CasperJS for scraping. CasperJS is the older and more well-known wrapper around PhantomJS so comparisons would help people decide what the appropriate tool would be.

http://casperjs.org/


I guess the biggest difference is, that Casper isn't a NodeJS module. Interaction with node (using the archive of npm) becomes hard. I am releasing something similar in a couple of weeks, but aiming at js sandboxing instead of scraping. :)


For instances where you don't need a full featured browser to get the data you need, Scraperjs using the Cheerio backend should be WAY faster than casper/phantom.

I've not used Scraperjs yet, but cheerio is pretty great.


I think you'd need CasperJS when you need to perform browser actions (login, click on a particular button, fill a form etc). But if you just want to scrape content (eg episode urls of The Daily Show from Hulu), then ScraperJS should be enough (and faster?)


As far as I can see it has not much in common with casperjs, apart from the fact that it can use phantomjs.


If you mean that the syntax is different, yes, I get that.

CasperJS can also scrape dynamic websites. What criteria would someone want to use ScraperJS instead of CasperJS for that task? Are there features in ScraperJS that don't exist in CasperJS? Does it take 10x less lines-of-code to accomplish the same task? Etc.


For web scraping purposes ScraperJS and ScraperJS would probably use the same lines of code, however ScraperJS has most of the tools you need for web scraping, something that CasperJS lacks (that's not their main goal). ScraperJS is move flexible than just CasperJS. If you want static content just use the static scraper and get lightening fast results. TL;DR: CasperJS is great but it's not made for web scraping.


If you're interested in scraping in python, then I recommend giving this a read: http://jakeaustwick.me/python-web-scraping-resource/


I think Scrapy is a better Python scraping tool.


And I prefer to use R for scraping. There are so many way to scrape and it is really down to personal preferences. It is a good time to be alive :) So many good choices.


This is awesome. I am very new to scraping, so bear with me if this is very obvious.

Would it be possible to follow a list of URLs from a home page (Ex: List of Marathon Runners), and then follow the link in their name that goes to their stats page, and download / save the scraped data as JSON to a text file on the local machine's C:\Runners\Data\ folder for example?

Also, does anyone know of a reliable and tested C# / .Net / ASP.Net web page scrapper?


On the second question, Typically a web scraper just interacts with the output of a web server, is shouldn't matter if it is asp.net or any other system.


Mostly this. In ASP.NET/C# you're probably looking at using the built in HttpClient lib [0] and an html parser lib like HTMLAgilityPack [1]. I've used this combo in the past and am happy with it.

[0] http://msdn.microsoft.com/en-us/library/system.net.http.http... [1] http://htmlagilitypack.codeplex.com/


Thanks!!


Since this runs un nodejs, you could use edge.js to use scraperjs in your .net project.


On the first question, yes, it's possible and easily done with the Router.


Thank you!


Let us know if you'd like to integrate this with http://www.80legs.com!


If anyone is interested in just scrapping links between webpages with JavaScript, I made Slinky (https://github.com/andrejewski/slinky). The API is simple and easily overridable.


Could someone recommend a similar framework but Ruby based? Just because I'm more skilled in Ruby than in Node (not for trolling purposes)

I've been exploring Github but could not find a well mantained framework (or at least updated to last month).


Check out Mechanize: https://github.com/sparklemotion/mechanize

I haven't used the ruby version but I am pretty happy with the python port of it. It's lighter and faster than phantom, but it won't do javascript interpretation.


I found a tutorial making it look pretty easy to do with Mechanize: http://readysteadycode.com/howto-scrape-websites-with-ruby-a...

Note, not tried this myself!


While I haven't tried any, I think if you want to handle dynamic Javascript content, you'd have to go with a JS library. Feel free to correct me if I'm wrong.


you can do it in pure ruby with one of the webkit wrappers (i.e. poltergeist[0])

[0] https://github.com/teampoltergeist/poltergeist)


I've always used Mechanize+BeautifulSoup in Python.. I think Mechanize has also a ruby lib...


Nice! Could've used that this weekend when I got caught in callback hell trying to build a simple NodeJS scraper. Ended up doing it in PHP just because I know it well.

I'll give it another go with this library next week!


I really like the router aspect of this. That's a nice idea and not (to the best of my limited memory) one I can recall seeing in any other scraper.


Artoo is soooo much better :) https://medialab.github.io/artoo/


I don't quite get the point of the DynamicScraper... Any real use cases for that?


For example, go to http://www.imdb.com

On the right, you'll notice that under the sidebar "Opening This Week" is a movie titled "Love Is Strange".

With that in mind, press Ctrl+U (view html source).

Try to search for the word "Strange" anywhere in the source. (It's not there.) If it's not there, how did it get shown on the screen?!

The answer is that it is "dynamically" loaded. A simple scraper that only works on a static download of html source won't be able to retrieve that string. You need web scrapers that can process dynamic pages (execute Javascript).

Btw, you'll notice that you can find the string "Strange" via F12 (Developer Tools). That's because the F12 inspector shows the html after the DOM has been dynamically modified by javascript whereas Ctrl+U does not.


The latter probably runs the script as though you are within the context of a web page (so full Ajax/JS support).

I assume the Simple version might be completely written in Node.js - so parses the HTML content, but no dynamic scripting.

The important thing to note is that in the Dynamic, you can't use closures in your internal functions as it wont get executed within your Node.js context, but will in PhantomJS.

As for use case, I do it for https://myshopdata.com to allow retailers to extract their product information with rich content and variation support (even if loaded by the user interacting with a dropdown on variations). It then allows you publish this in marketplaces, while information in sync by monitoring.


I _think_ the latter interprets javascript while the former only allows you to read the rendered html ?


if you want a scraper as service, you can try: https://PhantomJsCloud.com

disclaimer: i wrote it.


Looks pretty good, shame about the promises.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: