It's unclear to me how to actually run this. Only executing the two commands listed under the Installing section does not run it - I had to `cd` into the scraperjs dir, then `npm install`, then continue with the second Install command (`grunt test`) to actually test.
Also, do you install scraperjs into each project directory you want to use it for? Or just install it once?
Scraperjs is supposed to be used as npm package.
So, if you do "npm install <package-name>", you download the latest version of package to the same folder as the closest package.json file (if there's none it will go to your ~/ folder). At that point you can just use with "require('scraperjs')".
The test part is a bit more foggy, and I'll add more information to the README in due time. To test you've got to npm-install, with the save-dev flag (npm install --save-dev scraperjs), it will also add the package to your development dependencies, this is so that people that want to use the package won't need to download all scraperjs' development dependencies.
It would be helpful if the documentation compared how Scraperjs is different from, or better than, CasperJS for scraping. CasperJS is the older and more well-known wrapper around PhantomJS so comparisons would help people decide what the appropriate tool would be.
I guess the biggest difference is, that Casper isn't a NodeJS module. Interaction with node (using the archive of npm) becomes hard. I am releasing something similar in a couple of weeks, but aiming at js sandboxing instead of scraping. :)
For instances where you don't need a full featured browser to get the data you need, Scraperjs using the Cheerio backend should be WAY faster than casper/phantom.
I've not used Scraperjs yet, but cheerio is pretty great.
I think you'd need CasperJS when you need to perform browser actions (login, click on a particular button, fill a form etc). But if you just want to scrape content (eg episode urls of The Daily Show from Hulu), then ScraperJS should be enough (and faster?)
If you mean that the syntax is different, yes, I get that.
CasperJS can also scrape dynamic websites.
What criteria would someone want to use ScraperJS instead of CasperJS for that task?
Are there features in ScraperJS that don't exist in CasperJS?
Does it take 10x less lines-of-code to accomplish the same task? Etc.
For web scraping purposes ScraperJS and ScraperJS would probably use the same lines of code, however ScraperJS has most of the tools you need for web scraping, something that CasperJS lacks (that's not their main goal).
ScraperJS is move flexible than just CasperJS. If you want static content just use the static scraper and get lightening fast results.
TL;DR: CasperJS is great but it's not made for web scraping.
And I prefer to use R for scraping. There are so many way to scrape and it is really down to personal preferences. It is a good time to be alive :) So many good choices.
This is awesome. I am very new to scraping, so bear with me if this is very obvious.
Would it be possible to follow a list of URLs from a home page (Ex: List of Marathon Runners), and then follow the link in their name that goes to their stats page, and download / save the scraped data as JSON to a text file on the local machine's C:\Runners\Data\ folder for example?
Also, does anyone know of a reliable and tested C# / .Net / ASP.Net web page scrapper?
On the second question, Typically a web scraper just interacts with the output of a web server, is shouldn't matter if it is asp.net or any other system.
Mostly this. In ASP.NET/C# you're probably looking at using the built in HttpClient lib [0] and an html parser lib like HTMLAgilityPack [1]. I've used this combo in the past and am happy with it.
If anyone is interested in just scrapping links between webpages with JavaScript, I made Slinky (https://github.com/andrejewski/slinky). The API is simple and easily overridable.
I haven't used the ruby version but I am pretty happy with the python port of it. It's lighter and faster than phantom, but it won't do javascript interpretation.
While I haven't tried any, I think if you want to handle dynamic Javascript content, you'd have to go with a JS library. Feel free to correct me if I'm wrong.
Nice! Could've used that this weekend when I got caught in callback hell trying to build a simple NodeJS scraper. Ended up doing it in PHP just because I know it well.
I'll give it another go with this library next week!
On the right, you'll notice that under the sidebar "Opening This Week" is a movie titled "Love Is Strange".
With that in mind, press Ctrl+U (view html source).
Try to search for the word "Strange" anywhere in the source. (It's not there.) If it's not there, how did it get shown on the screen?!
The answer is that it is "dynamically" loaded. A simple scraper that only works on a static download of html source won't be able to retrieve that string. You need web scrapers that can process dynamic pages (execute Javascript).
Btw, you'll notice that you can find the string "Strange" via F12 (Developer Tools). That's because the F12 inspector shows the html after the DOM has been dynamically modified by javascript whereas Ctrl+U does not.
The latter probably runs the script as though you are within the context of a web page (so full Ajax/JS support).
I assume the Simple version might be completely written in Node.js - so parses the HTML content, but no dynamic scripting.
The important thing to note is that in the Dynamic, you can't use closures in your internal functions as it wont get executed within your Node.js context, but will in PhantomJS.
As for use case, I do it for https://myshopdata.com to allow retailers to extract their product information with rich content and variation support (even if loaded by the user interacting with a dropdown on variations). It then allows you publish this in marketplaces, while information in sync by monitoring.
Also, do you install scraperjs into each project directory you want to use it for? Or just install it once?