More

BlewisJS · 2024-07-26T15:58:56 1722009536

Sorry to be pedantic but Rozman isn't a GM

Maxatar · 2024-07-26T16:02:09 1722009729

Worthwhile correction, thank you. He is an International Master.

BlewisJS · on April 17, 2022

Stock buybacks: https://ycharts.com/companies/GOOG/stock_buyback

BlewisJS · on Feb 9, 2022

Unrelated to the article - is it just me or is this scrapingbee product borderline nefarious? From the homepage:

> Thanks to our large proxy pool, you can bypass rate limiting website, lower the chance to get blocked and hide your bots!

> Scrapingbee helps us to retrieve information from sites that use very sophisticated mechanism to block unwanted traffic, we were struggling with those sites for some time now and I'm very glad that we found ScrapingBee.

whakim · on Feb 9, 2022

It really depends. There are plenty of legitimate uses for scraping (for example, I've been involved with academic research that involved scraping Twitter search results), and it's only really feasible to collect the amount of data you need using scraping plus paid proxies. That being said, there are also a number of nefarious paid proxy services which offer residential IPs (read: are usually botnets).

BlewisJS · on Feb 9, 2022

What is legitimate to a user is not the same as what is legitimate to a site owner. The legitimate way would probably be to use the Twitter API.

whakim · on Feb 9, 2022

The Twitter API has very low rate limits (from a data collection perspective). While there may be good reasons for that, these limits also preclude doing public interest research of the type we were doing (how Twitter's various search filters influence the political leanings of search results). When companies have Twitter's level of societal influence, I think it's also possible to define "legitimate use" in terms of public interest, rather than simply "users" or "site owners."

stickfigure · on Feb 9, 2022

No more nefarious than the measures websites put up to avoid scrapers? This just rehashes the Linkedin vs Hiq case: https://en.wikipedia.org/wiki/HiQ_Labs_v._LinkedIn

(not a user, but I do some amount of scraping through other means)

brimble · on Feb 9, 2022

It is definitely super annoying that companies are allowed to spy on us and do all kinds of crazy things with our data, all using computers and automation and "bots" and such, but individuals are increasingly not allowed to use automation to help us out online. Seems rather one-sided. On the other hand, I get that abuse is a huge problem. I do wish at least bots operating at roughly human request rates & daily total requests were considered OK and universally allowed without risk of blocks or other difficulties leading to increased maintenance costs (so, making them less valuable).

bydlocoder · on Feb 9, 2022

Sometimes the scraping situation gets kinda ironic. I worked at a large eRetailer/marketplace and obviously we scraped our major competitors just as they scraped us (there are four major marketplaces here). So each company had a team to implement anti-scraping measures and defeat competitor's defences. Instead of providing an API everyone decided to spend time and money on this useless weapons race.

brimble · on Feb 9, 2022

Absent someone breaking really far away from the pack, that's a classic example of one type of "bullshit job" called out in Graeber's book... Bullshit Jobs. Zero-sum, ever-escalating competition. Militaries are another obvious example (we'd all be better off if every country's military spending were far closer to zero—but no one country can risk lowering it unilaterally, and may even be inclined to increase theirs in response to neighbors, which sometimes gets so insanely wasteful that you see something like the London Naval Treaty or SALT come about in response) but so is a great deal of advertising and marketing activity (you have to spend more only because your competitor started spending more—end result, status quo maintained, but more money spent all around)

bydlocoder · on Feb 9, 2022

I wonder how anyone in IT could take Graeber seriously. One of his opinions about programming was that programmers work "bullshit jobs" for their employer and do cool open source stuff in their free time which is demonstrably false.

brimble · on Feb 9, 2022

The presentation of that in the book, based off a message from someone in the industry, doesn't seem out of line with the overall tone and reliability-level that Graeber explicitly sets out in the beginning, which is both that the book is not rigorous science and that it's mainly concerned with considering why people's perceptions of their own jobs would be that they're bullshit.

[EDIT]

> One of his opinions about programming was that programmers work "bullshit jobs" for their employer and do cool open source stuff in their free time which is demonstrably false.

Further, I'm not even sure that's incorrect. It can both be true that most open source (that's actually used by anyone) is done by people who are paid to do it, and that most programmers have very little interesting or challenging to do at work unless they work on hobby projects—maybe open source—in their free time.

The overall letter as quoted in the book, and Graeber's commentary on it, actually makes some good points aside from all this. Things don't have to be perfect to be useful.

bydlocoder · on Feb 9, 2022

The job being un-interesting and un-rewarding doesn't make it bullshit. The job of a truck or a taxi driver is boring as fuck, but it's not bullshit.

vorpalhex · on Feb 9, 2022

A company my previous employer partnered with once asked us to integrate with them.. via scraping and using bots to fill out forms.

Which would have been fine except they also imposed terribly low rate limits with no ability to check them.

We eventually pulled the partnership since it was more work than value.

underwater · on Feb 10, 2022

A lot of data I provide to services is exposed to other individuals so that the service can function. They doesn't mean that data belongs to those people or that they can feely use that data elsewhere.

Allowing unfettered scraping and repurposing of data would have a chilling effect on all types of services. For example I wouldn't necessarily want a bot to scrape my comment history on HN, doxx me, and share my identity and comments with others.

samwillis · on Feb 9, 2022

I believe whenever the “no automation/scraping/bots” clause in Ts&Cs has been test in court they have never held up. However that’s not to say a service can’t just cancel your account if you are found to be using one.

Running a site thats had a bot get stuck in a loop and suddenly x10000 times the request rate, when they go wrong it’s super annoying for the website owner. We ultimately just banned the whole AWS ip ranges.

paxys · on Feb 9, 2022

"Nefarious" is a strong word. Courts have repeatedly ruled that scraping data that is otherwise available publicly is legal. You may not personally agree with the ethics, but there are a lot of people who do.

BlewisJS · on Feb 9, 2022

I agree it's a strong word, which is why I said borderline nefarious. However, it's not that far off from a DDOS tool.

At least in the United States, sounds like the jury is still out on the legality: https://www.reuters.com/technology/us-supreme-court-revives-..., but my perspective was more from an ethics standpoint anyway.

paxys · on Feb 9, 2022

It is very far from a DDOS tool. Scraping can be done from a single source, one request at a time, with self imposed rate limits. Sure it can overwhelm a server, but then so can a single user opening 10 tabs.

conductr · on Feb 9, 2022

> Scraping can be done from a single source

That's not what this tool does though. It allows you to distribute your scraping to a layer of proxies. So, the only difference is whether there is an intent to do harm to the target or merely collect data... which could be a form of doing harm as well?

greycol · on Feb 9, 2022

There are plenty of tools like this where going up to the line is much different than crosing it. There's a vast difference between driving your car to an event and driving the few extra meters into the crowd at an event. You can cut down a tree with a chainsaw or cut down a tree onto your neighbours house.

There's definetly an argument that dangerous tools should be regulated to varying degrees. If we're arguing regulations in this specific area you'd probably also be balancing it with regulations that sites can't close an account for reasonable rate automated access and that public research can have higher rates so long as they're not crippling.

conductr · on Feb 10, 2022

The tree example is true and why I agree these things are very similar. The only significant difference is when you put it on your neighbor’s house on purpose.

I wouldn’t regulate this but If you’re introducing regulations, why not just require the source to deliver the data in a neatly packaged format? The necessity for scraping and the potential for DDOS and potentially nefarious behavior basically goes away.

codazoda · on Feb 9, 2022

Based on another comment, and the wikipedia article they linked to, it looks like the Supreme Court vacated the decision and remanded the case for further review in June 2021 (probably after this article).[1] Unfortunately there is no citation for that sentence so I'm not entirely sure.

I think that means the jury is still out, as you mentioned, but it's leaning towards scraping being legal as long as the data is publicly available. IANAL

[1] https://en.wikipedia.org/wiki/HiQ_Labs_v._LinkedIn

fjabre · on Feb 9, 2022

Nefarious? Then they should arrest Google first, it is the king of web scrapers.

NicoJuicy · on Feb 9, 2022

Robots.txt

collateral0 · on Feb 9, 2022

If the google crawler actually respected robots.txt your point might be salient.

NicoJuicy · on Feb 9, 2022

It does.

Please verify your experience with the Google ip range.

https://developers.google.com/search/docs/advanced/crawling/...

A lot of crawlers spoof the Googlebot user agent so you wouldn't block them ;)

fjabre · on Feb 10, 2022

Surely you must be joking. Alphabet is the largest web scraper in the world. They would soon go out of business if robots.txt was the only data they scraped.

It’s not a web crawler. They are all web scrapers. And Alphabet/Google sells this data and makes profits from it.

It is not like it is trying to hide the fact that it is king web scraper.

Google has gotten in trouble from various publishers for this before. It is no secret there is a double standard in big tech.

Again if you are going to arrest a web scraper, then arrest the king of all web scrapers first to make it fair.

Data wants to be free. If it is publicly accessible then it is fair game.

NicoJuicy · on Feb 10, 2022

I'm probably not going to get a reply, but let's try:

Source ?

fjabre · on Feb 11, 2022

You are stating that Google has never acted in bad faith and that robots.txt is the only thing that Google looks at when crawling/scraping the web.

You’re a smart guy. Surely you must know how ridiculous that sounds on the face of it.

It is common sense.

The sky is blue.

Source: Look up at the sky.

NicoJuicy · on Feb 11, 2022

So, no source? Your response is unrelated to the statement at hand.

Think about it: Google has every advantage by respecting robots.txt and nothing to win by ignoring it.

Eg.

1) If a media company doesn't want to get crawled: add it in robots.txt

Then they realize their visitors drops and they'll remove it again.

Ergo: publishers sue. Because they want the advantages, but without the scraping. Which doesn't seem logical to me, since they currently give Google explicit permission to scrape content.

2) if they would sometimes leak personal documents protected by robots.txt they could have a lot of lawsuits on their hands.

Robots.txt is a simple method to not get blamed.

Ignoring robots.txt could literally be a core business liability from my POV.

---

So please, source outside of gut feeling, as requested before, would be greatly appreciated.

fjabre · on Feb 16, 2022

My point is that they scrape the web for data because that is their core business.

Im not sure why robots.txt was even brought up.

So google respects this file? I say so what.

Im arguing that while Google has free reign to scrape whatever data it wants, we indie devs are subject to the cider house rules.

Sources can be found for just about any argument. So they are more or less useless.

There is nothing wrong with self evident truths or reasonable hypotheses. That is how the modern world was created.

A search engine that scrapes the web for data to make a good search engine. Who wouldve dreamed of it?

We are not privy to what happens behinds closed doors at Google. They only work for their shareholders. Not us or the public good.

Source that google does what it wants based on what it thinks the web should be. Google can change its mind on a whim https://www.searchenginejournal.com/google-robots-txt-noinde...

Eikon · on Feb 9, 2022

It does.

fjabre · on Feb 10, 2022

Think how ridiculous it sounds that Google only has URLs listed in robot.txt. They wouldve gone out of business long ago.

NicoJuicy · on Feb 11, 2022

Do you know how robots.txt works?

It's an exclusion standard, not an inclusion one.

https://en.m.wikipedia.org/wiki/Robots_exclusion_standard

For helping individual url discovery, you can use sitemap.xml.

In case you know how it works ( and i suppose so considering your account age), your comment is just weird tbh.

fjabre · on Feb 16, 2022

Google scrapes web data is my point. It is king web scraper.

Robots.txt does not fit into this argument. Im not sure why it was brought up. Google doesn’t scrape urls listed there? Ok. And so? Am I to believe that just because Google says so?

Google scrapes what it wants. It does so for its shareholders. It could care less about web standards.

Source: Amp

BlewisJS · on Oct 23, 2021

Sorry to be a pedant, but pedant, not pendant.

patrec · on Oct 24, 2021

Haha, thanks!

BlewisJS · on June 24, 2021

This kinda already exists: https://aws.amazon.com/ground-station/

marcus_holmes · on June 24, 2021

Is this becoming the tech equivalent of "not The Onion"? Pick an implausible service and guess if it's actually available on AWS...

BlewisJS · on Dec 23, 2020

Is it amoral to turn the volume down during radio ads?

wolco2 · on Dec 24, 2020

I don't think it's amoral just the same as pirate movies.

It's the same.

BlewisJS · on Nov 2, 2020

> "You can also sign up with your email or SSO."

https://coda.io/signup/email

BlewisJS · on Oct 17, 2020

If you know you have the flu (or COVID-19), surely you know you are likely contagious and can spread it to more vulnerable people. Yes, it is your fault and you deserve blame if you knowingly spread it to others through negligence. Stay home if you're sick.

__blockcipher__ · on Oct 17, 2020

I agree those who are symptomatic should stay home. But for COVID we go further and tell asymptomatic people to perform (flawed) transmission control measures. Where do we do that with Flu?

The point is if I don’t know I have COVID and spread it to Grandma at the grocery store, in your eyes I’ve killed Grandma. I wonder why we don’t apply that logic everywhere.

kvez · on Oct 17, 2020

The influenza vaccine is a flawed transmission control measure and is the best we have available, same as masks. We do blame people who skip those, although it doesn't show on their faces like the absence of a mask does.

The spreadability of the flu is also much lower than that of COVID-19 (largely thanks to the vaccines), which is really why people never regarded masks as necessary for the flu.

BlewisJS · on June 15, 2020

What could they possibly call the cops about? "This man keeps accepting our offers"?

SllX · on June 16, 2020

You can call the cops about anything on anyone. That doesn’t mean they’ll make an arrest.

BlewisJS · on Jan 30, 2020

Are you implying people hate Android and that iOS is way bigger in terms of market share?