Show HN: Scraperjs – A versatile web scraper

brianzelip · on Aug 18, 2014

It's unclear to me how to actually run this. Only executing the two commands listed under the Installing section does not run it - I had to `cd` into the scraperjs dir, then `npm install`, then continue with the second Install command (`grunt test`) to actually test.

Also, do you install scraperjs into each project directory you want to use it for? Or just install it once?

ruipgil · on Aug 18, 2014

Scraperjs is supposed to be used as npm package. So, if you do "npm install <package-name>", you download the latest version of package to the same folder as the closest package.json file (if there's none it will go to your ~/ folder). At that point you can just use with "require('scraperjs')". The test part is a bit more foggy, and I'll add more information to the README in due time. To test you've got to npm-install, with the save-dev flag (npm install --save-dev scraperjs), it will also add the package to your development dependencies, this is so that people that want to use the package won't need to download all scraperjs' development dependencies.

For more information about npm install: https://www.npmjs.org/doc/cli/npm-install.html

jasode · on Aug 18, 2014

It would be helpful if the documentation compared how Scraperjs is different from, or better than, CasperJS for scraping. CasperJS is the older and more well-known wrapper around PhantomJS so comparisons would help people decide what the appropriate tool would be.

http://casperjs.org/

tsenkov · on Aug 18, 2014

I guess the biggest difference is, that Casper isn't a NodeJS module. Interaction with node (using the archive of npm) becomes hard. I am releasing something similar in a couple of weeks, but aiming at js sandboxing instead of scraping. :)

jdc0589 · on Aug 18, 2014

For instances where you don't need a full featured browser to get the data you need, Scraperjs using the Cheerio backend should be WAY faster than casper/phantom.

I've not used Scraperjs yet, but cheerio is pretty great.

findjashua · on Aug 18, 2014

I think you'd need CasperJS when you need to perform browser actions (login, click on a particular button, fill a form etc). But if you just want to scrape content (eg episode urls of The Daily Show from Hulu), then ScraperJS should be enough (and faster?)

thibauts · on Aug 18, 2014

As far as I can see it has not much in common with casperjs, apart from the fact that it can use phantomjs.

jasode · on Aug 18, 2014

If you mean that the syntax is different, yes, I get that.

CasperJS can also scrape dynamic websites. What criteria would someone want to use ScraperJS instead of CasperJS for that task? Are there features in ScraperJS that don't exist in CasperJS? Does it take 10x less lines-of-code to accomplish the same task? Etc.

ruipgil · on Aug 18, 2014

For web scraping purposes ScraperJS and ScraperJS would probably use the same lines of code, however ScraperJS has most of the tools you need for web scraping, something that CasperJS lacks (that's not their main goal). ScraperJS is move flexible than just CasperJS. If you want static content just use the static scraper and get lightening fast results. TL;DR: CasperJS is great but it's not made for web scraping.

halcyondaze · on Aug 18, 2014

If you're interested in scraping in python, then I recommend giving this a read: http://jakeaustwick.me/python-web-scraping-resource/

cridenour · on Aug 18, 2014

I think Scrapy is a better Python scraping tool.

baldfat · on Aug 18, 2014

And I prefer to use R for scraping. There are so many way to scrape and it is really down to personal preferences. It is a good time to be alive :) So many good choices.

justboxing · on Aug 18, 2014

This is awesome. I am very new to scraping, so bear with me if this is very obvious.

Would it be possible to follow a list of URLs from a home page (Ex: List of Marathon Runners), and then follow the link in their name that goes to their stats page, and download / save the scraped data as JSON to a text file on the local machine's C:\Runners\Data\ folder for example?

Also, does anyone know of a reliable and tested C# / .Net / ASP.Net web page scrapper?

cwbrandsma · on Aug 18, 2014

On the second question, Typically a web scraper just interacts with the output of a web server, is shouldn't matter if it is asp.net or any other system.

misterbwong · on Aug 18, 2014

Mostly this. In ASP.NET/C# you're probably looking at using the built in HttpClient lib [0] and an html parser lib like HTMLAgilityPack [1]. I've used this combo in the past and am happy with it.

[0] http://msdn.microsoft.com/en-us/library/system.net.http.http... [1] http://htmlagilitypack.codeplex.com/

justboxing · on Aug 18, 2014

Thanks!!

guiomie · on Aug 19, 2014

Since this runs un nodejs, you could use edge.js to use scraperjs in your .net project.

ruipgil · on Aug 18, 2014

On the first question, yes, it's possible and easily done with the Router.

justboxing · on Aug 18, 2014

Thank you!

jdrock · on Aug 18, 2014

Let us know if you'd like to integrate this with http://www.80legs.com!

andrejewski · on Aug 18, 2014

If anyone is interested in just scrapping links between webpages with JavaScript, I made Slinky (https://github.com/andrejewski/slinky). The API is simple and easily overridable.

pibefision · on Aug 18, 2014

Could someone recommend a similar framework but Ruby based? Just because I'm more skilled in Ruby than in Node (not for trolling purposes)

I've been exploring Github but could not find a well mantained framework (or at least updated to last month).

wc- · on Aug 18, 2014

Check out Mechanize: https://github.com/sparklemotion/mechanize

I haven't used the ruby version but I am pretty happy with the python port of it. It's lighter and faster than phantom, but it won't do javascript interpretation.

jwarren · on Aug 18, 2014

I found a tutorial making it look pretty easy to do with Mechanize: http://readysteadycode.com/howto-scrape-websites-with-ruby-a...

Note, not tried this myself!

findjashua · on Aug 18, 2014

While I haven't tried any, I think if you want to handle dynamic Javascript content, you'd have to go with a JS library. Feel free to correct me if I'm wrong.

riffraff · on Aug 18, 2014

you can do it in pure ruby with one of the webkit wrappers (i.e. poltergeist[0])

[0] https://github.com/teampoltergeist/poltergeist)

fintechie · on Aug 18, 2014

I've always used Mechanize+BeautifulSoup in Python.. I think Mechanize has also a ruby lib...

jwarren · on Aug 18, 2014

Nice! Could've used that this weekend when I got caught in callback hell trying to build a simple NodeJS scraper. Ended up doing it in PHP just because I know it well.

I'll give it another go with this library next week!

bshimmin · on Aug 18, 2014

I really like the router aspect of this. That's a nice idea and not (to the best of my limited memory) one I can recall seeing in any other scraper.

roux_rc · on Aug 19, 2014

Artoo is soooo much better :) https://medialab.github.io/artoo/

mr5iff · on Aug 18, 2014

I don't quite get the point of the DynamicScraper... Any real use cases for that?

jasode · on Aug 18, 2014

For example, go to http://www.imdb.com

On the right, you'll notice that under the sidebar "Opening This Week" is a movie titled "Love Is Strange".

With that in mind, press Ctrl+U (view html source).

Try to search for the word "Strange" anywhere in the source. (It's not there.) If it's not there, how did it get shown on the screen?!

The answer is that it is "dynamically" loaded. A simple scraper that only works on a static download of html source won't be able to retrieve that string. You need web scrapers that can process dynamic pages (execute Javascript).

Btw, you'll notice that you can find the string "Strange" via F12 (Developer Tools). That's because the F12 inspector shows the html after the DOM has been dynamically modified by javascript whereas Ctrl+U does not.

martin-adams · on Aug 18, 2014

The latter probably runs the script as though you are within the context of a web page (so full Ajax/JS support).

I assume the Simple version might be completely written in Node.js - so parses the HTML content, but no dynamic scripting.

The important thing to note is that in the Dynamic, you can't use closures in your internal functions as it wont get executed within your Node.js context, but will in PhantomJS.

As for use case, I do it for https://myshopdata.com to allow retailers to extract their product information with rich content and variation support (even if loaded by the user interacting with a dropdown on variations). It then allows you publish this in marketplaces, while information in sync by monitoring.

riffraff · on Aug 18, 2014

I _think_ the latter interprets javascript while the former only allows you to read the rendered html ?

novaleaf · on Aug 18, 2014

if you want a scraper as service, you can try: https://PhantomJsCloud.com

disclaimer: i wrote it.

woah · on Aug 18, 2014

Looks pretty good, shame about the promises.