Unrelated to the article - is it just me or is this scrapingbee product borderline nefarious? From the homepage:
> Thanks to our large proxy pool, you can bypass rate limiting website, lower the chance to get blocked and hide your bots!
> Scrapingbee helps us to retrieve information from sites that use very sophisticated mechanism to block unwanted traffic, we were struggling with those sites for some time now and I'm very glad that we found ScrapingBee.
It really depends. There are plenty of legitimate uses for scraping (for example, I've been involved with academic research that involved scraping Twitter search results), and it's only really feasible to collect the amount of data you need using scraping plus paid proxies. That being said, there are also a number of nefarious paid proxy services which offer residential IPs (read: are usually botnets).
The Twitter API has very low rate limits (from a data collection perspective). While there may be good reasons for that, these limits also preclude doing public interest research of the type we were doing (how Twitter's various search filters influence the political leanings of search results). When companies have Twitter's level of societal influence, I think it's also possible to define "legitimate use" in terms of public interest, rather than simply "users" or "site owners."
It is definitely super annoying that companies are allowed to spy on us and do all kinds of crazy things with our data, all using computers and automation and "bots" and such, but individuals are increasingly not allowed to use automation to help us out online. Seems rather one-sided. On the other hand, I get that abuse is a huge problem. I do wish at least bots operating at roughly human request rates & daily total requests were considered OK and universally allowed without risk of blocks or other difficulties leading to increased maintenance costs (so, making them less valuable).
Sometimes the scraping situation gets kinda ironic. I worked at a large eRetailer/marketplace and obviously we scraped our major competitors just as they scraped us (there are four major marketplaces here). So each company had a team to implement anti-scraping measures and defeat competitor's defences. Instead of providing an API everyone decided to spend time and money on this useless weapons race.
Absent someone breaking really far away from the pack, that's a classic example of one type of "bullshit job" called out in Graeber's book... Bullshit Jobs. Zero-sum, ever-escalating competition. Militaries are another obvious example (we'd all be better off if every country's military spending were far closer to zero—but no one country can risk lowering it unilaterally, and may even be inclined to increase theirs in response to neighbors, which sometimes gets so insanely wasteful that you see something like the London Naval Treaty or SALT come about in response) but so is a great deal of advertising and marketing activity (you have to spend more only because your competitor started spending more—end result, status quo maintained, but more money spent all around)
I wonder how anyone in IT could take Graeber seriously. One of his opinions about programming was that programmers work "bullshit jobs" for their employer and do cool open source stuff in their free time which is demonstrably false.
The presentation of that in the book, based off a message from someone in the industry, doesn't seem out of line with the overall tone and reliability-level that Graeber explicitly sets out in the beginning, which is both that the book is not rigorous science and that it's mainly concerned with considering why people's perceptions of their own jobs would be that they're bullshit.
[EDIT]
> One of his opinions about programming was that programmers work "bullshit jobs" for their employer and do cool open source stuff in their free time which is demonstrably false.
Further, I'm not even sure that's incorrect. It can both be true that most open source (that's actually used by anyone) is done by people who are paid to do it, and that most programmers have very little interesting or challenging to do at work unless they work on hobby projects—maybe open source—in their free time.
The overall letter as quoted in the book, and Graeber's commentary on it, actually makes some good points aside from all this. Things don't have to be perfect to be useful.
A lot of data I provide to services is exposed to other individuals so that the service can function. They doesn't mean that data belongs to those people or that they can feely use that data elsewhere.
Allowing unfettered scraping and repurposing of data would have a chilling effect on all types of services. For example I wouldn't necessarily want a bot to scrape my comment history on HN, doxx me, and share my identity and comments with others.
I believe whenever the “no automation/scraping/bots” clause in Ts&Cs has been test in court they have never held up. However that’s not to say a service can’t just cancel your account if you are found to be using one.
Running a site thats had a bot get stuck in a loop and suddenly x10000 times the request rate, when they go wrong it’s super annoying for the website owner. We ultimately just banned the whole AWS ip ranges.
"Nefarious" is a strong word. Courts have repeatedly ruled that scraping data that is otherwise available publicly is legal. You may not personally agree with the ethics, but there are a lot of people who do.
It is very far from a DDOS tool. Scraping can be done from a single source, one request at a time, with self imposed rate limits. Sure it can overwhelm a server, but then so can a single user opening 10 tabs.
That's not what this tool does though. It allows you to distribute your scraping to a layer of proxies. So, the only difference is whether there is an intent to do harm to the target or merely collect data... which could be a form of doing harm as well?
There are plenty of tools like this where going up to the line is much different than crosing it. There's a vast difference between driving your car to an event and driving the few extra meters into the crowd at an event. You can cut down a tree with a chainsaw or cut down a tree onto your neighbours house.
There's definetly an argument that dangerous tools should be regulated to varying degrees. If we're arguing regulations in this specific area you'd probably also be balancing it with regulations that sites can't close an account for reasonable rate automated access and that public research can have higher rates so long as they're not crippling.
The tree example is true and why I agree these things are very similar. The only significant difference is when you put it on your neighbor’s house on purpose.
I wouldn’t regulate this but If you’re introducing regulations, why not just require the source to deliver the data in a neatly packaged format? The necessity for scraping and the potential for DDOS and potentially nefarious behavior basically goes away.
Based on another comment, and the wikipedia article they linked to, it looks like the Supreme Court vacated the decision and remanded the case for further review in June 2021 (probably after this article).[1] Unfortunately there is no citation for that sentence so I'm not entirely sure.
I think that means the jury is still out, as you mentioned, but it's leaning towards scraping being legal as long as the data is publicly available. IANAL
Surely you must be joking. Alphabet is the largest web scraper in the world. They would soon go out of business if robots.txt was the only data they scraped.
It’s not a web crawler. They are all web scrapers. And Alphabet/Google sells this data and makes profits from it.
It is not like it is trying to hide the fact that it is king web scraper.
Google has gotten in trouble from various publishers for this before. It is no secret there is a double standard in big tech.
Again if you are going to arrest a web scraper, then arrest the king of all web scrapers first to make it fair.
Data wants to be free. If it is publicly accessible then it is fair game.
So, no source? Your response is unrelated to the statement at hand.
Think about it: Google has every advantage by respecting robots.txt and nothing to win by ignoring it.
Eg.
1) If a media company doesn't want to get crawled: add it in robots.txt
Then they realize their visitors drops and they'll remove it again.
Ergo: publishers sue. Because they want the advantages, but without the scraping. Which doesn't seem logical to me, since they currently give Google explicit permission to scrape content.
2) if they would sometimes leak personal documents protected by robots.txt they could have a lot of lawsuits on their hands.
Robots.txt is a simple method to not get blamed.
Ignoring robots.txt could literally be a core business liability from my POV.
---
So please, source outside of gut feeling, as requested before, would be greatly appreciated.
Google scrapes web data is my point. It is king web scraper.
Robots.txt does not fit into this argument. Im not sure why it was brought up. Google doesn’t scrape urls listed there? Ok. And so? Am I to believe that just because Google says so?
Google scrapes what it wants. It does so for its shareholders. It could care less about web standards.
If you know you have the flu (or COVID-19), surely you know you are likely contagious and can spread it to more vulnerable people. Yes, it is your fault and you deserve blame if you knowingly spread it to others through negligence. Stay home if you're sick.
I agree those who are symptomatic should stay home. But for COVID we go further and tell asymptomatic people to perform (flawed) transmission control measures. Where do we do that with Flu?
The point is if I don’t know I have COVID and spread it to Grandma at the grocery store, in your eyes I’ve killed Grandma. I wonder why we don’t apply that logic everywhere.
The influenza vaccine is a flawed transmission control measure and is the best we have available, same as masks. We do blame people who skip those, although it doesn't show on their faces like the absence of a mask does.
The spreadability of the flu is also much lower than that of COVID-19 (largely thanks to the vaccines), which is really why people never regarded masks as necessary for the flu.