Hacker News new | past | comments | ask | show | jobs | submit login
Detecting Chrome headless (antoinevastel.github.io)
375 points by avastel on Aug 5, 2017 | hide | past | favorite | 157 comments



Your solutions in detecting Chrome headless is good.

But someone who really wants to do web scraping or anything similar will use a real browser like Firefox or Chrome run it through xvfb and control it using webdriver and maybe expose it through an API. I find these to be almost undetectable.. The only way you can mitigate this is to do more interesting mitigation techniques. Liie IP detection, Captchas, etc.

edit: when I say real browser, I mean running the full browser process including extensions etc.


Yes, but it's slower and more expensive that way. If you want to run a few hundred parallel spider processes with full browser in each, it's neither easy nor cheap anymore. It takes a lot more resources than running it headless, plus adds a significant overhead to automate and control all that.

Ultimately, no protection is unbreakable, there's a work around for almost everything. If your site has thousands of pages (e.g. big online stores, that are common target for spidering) it's probably the best approach to make things as slow and complicated for spider author as possible. That's exactly the same logic like with captchas, they can be broken fairly easy nowadays, but they'll still slow down the spidering rate and pump up the cost.


> If you want to run a few hundred parallel spider processes with full browser in each, it's neither easy nor cheap anymore. It takes a lot more resources than running it headless, plus adds a significant overhead to automate and control all that.

This is not true. I've seen it done in AWS for less than $2,000 monthly, and I can do it from my home for less than $500 per month (minus the costs of my workstation and networking gear, which you'd amortize over the expected lifetime of the project). I have a home server with 128GB of RAM and an i7-6900K, with 125 static IPs and a bunch of Ubiquiti networking gear. You don't even need that much memory or compute power, but I also use my workstation for other research projects. I use my own static IPs to parallelize without having to sacrifice latency or cede control to a shady proxy farm.

It's a pretty straightforward setup - each static IP is given its own route across the switches from the gateway, and the switches have link aggregated connections for 40GbE bandwidth. You have pub sub and queuing, and each scraping target is its own headless Chrome process. The requests to the target from each process are round robin sent across the available interfaces. Then you've got parsing and a local database.

I'm not a particularly invasive scraper (I make my User Agent deliberately obvious with an explicit way to opt out), but it's really not true that it's prohibitively expensive. This is a pretty cheap setup; I don't even profit from the work, I use it for personal research. If someone was actively selling valuable data, this would absolutely be worth it.


Curious about your home setup, what ISP are you using at home that lets you have essentially a /25 block of public IPs, let alone 40GbE of bandwidth? Especially if this is costing you $500/month.


I have Verizon Fios set up as a business account to my home. However that bandwidth is for SAN to workstation data processing and analysis. There's only half a gigabit of external internet bandwidth, but I have hundreds of terabytes of data (and high tens of gigabytes are downloaded per day) in local storage.

With 128GB of RAM I can only load small amounts into memory for targeted analysis. Processing the rest of the data requires loading it directly from storage. To improve I/O performance I parallelize the transfer across link aggregated ethernet interfaces. Naturally that would cause disk reads to become a bottleneck; to take proper advantage of the network transfer speeds I hold all data in RAID 0 with 7200 RPM drives.


if your IPs are in a contiguous /25 , they can all be blocked with a single firewall rule. And detection is also easy. It only helps with rate limits.

To avoid a block, you need a list of good socks5 proxies


> if your IPs are in a contiguous /25

They're not.

> To avoid a block, you need a list of good socks5 proxies

No; furthermore, that leaks your data to a third party and introduces significant latency.


Certainly curious about getting 40GbE bandwidth, but the IP addresses doesn't seem like much of a hurdle. I was looking at business broadband a couple of weeks ago, and getting 13 static IPs was only £5 extra a month. Going all the way up to a leased line included unlimited "subject to internet regulations".


I'd love to read more details about your hardware setup. Care writing a blog post or something similar?


It's a matter of a personal perspective. If you're based in US and have a luxury of working with enterprise clients than sure. On the other hand, few years ago when I was doing data scrapping as a freelancer, price tag of $2K just for the infrastructure would be a huge show stopper for the most of my average clients back then. We were charging them way less than that.


Would you be willing to chat more about this one on one for research (non-commercial) purposes?


Sure. I don't sell data but I'm happy to talk shop or help with interesting research projects.


Every spider process is a browser tab. People have hundreds tabs opened without any problem on ordinal computers. On a dedicated server with 128GB of ram you can run thousands of those and it will cost you 100$-200$/month.


True -- But the user is not interacting with all 100's of those tabs at the same time -- something an automated scraper would be.

Firefox + Chrome both lower the priority of background tabs, and may be doing other tricks so the background tabs can stick around and be switched to quickly.


Well, most of the time spiders are idle while waiting for IO operations to complete. Main bottleneck is not CPU, but RAM and bandwidth.


It depends on what analysis is being done on each page.


What kind of analysis are you expecting from a crawler which main purpose is to grab a webpage?


Main purpose is to render the page, crawl the dom for the data, and then load the next page as fast as possible... so it's equivalent of having a thousand bookmarks and opening them all at the same time, and as soon each loads running the scraper script and reloading them with the new link. Try it for yourself, it's easy to script, and check the load...


Presumably, there is some reason you are crawling.


Where can I get a dedicated server with 128GB of RAM for $100-$200/mo?


In addition to complicated, making things unpredictable makes things very difficult for programmatic scraping:

  * machine-generated HTML and CSS identifiers
  * randomly inserted, unused HTML elements
  * images instead of text


To be honest (and speaking from experience), the only thing that would actually make life hard for a scraper is the last suggestion. I've never personally dealt with that, which is the only reason why I concede it.

Machine generated HTML is a pain, but I wouldn't call that "very difficult" - it's more like "annoying" in the same way that having to parse HTML instead of finding a neat JSON endpoint is.

And I'm not sure how randomly inserted HTML elements would help - if you're already parsing the HTML, you can extract the relevant data. Unless you're trying to parse the HTML with regex, in which case: https://stackoverflow.com/questions/1732348/regex-match-open...


The first two bullets combined can make it hard to parse the page reliably, but it's also making very hard the maintenance and using of any 'good' javascript and/or css too. Not worth the trouble usually.


In addition to dsacco's reply it should be mention that using images instead of text hurts accessibility and, depending on your industry and country, it may also run afoul of equal access laws.


Depending on a page and data it can make life a bit harder for the spider, but all that can be taken care of.

IMHO the best way to stop spiders is to control access to your pages. Require users to register and then login each time to see the data, and then have some pro-active monitoring & counter measures in place, monitoring the patterns per users, and per IPs, and across the whole system for an unexpected increase in activity or bot-like behavior (moving too fast or at unusually uniform speed, following links too sequentially, etc.).

If it's not an option, the next best approach is to make it hard for spider to fetch the whole dataset. Don't let user just browse all the pages, instead force them to use search, it makes it much harder to cover everything (and it will not affect normal users too much). You can, for example, return just the 100 products at once, and ask user to refine the search if there's more products than that. Then you can create some monitoring system to watch over unusual search queries that look like dictionary attacks. I've actually built a system like this for one client and it worked fairly well (combined with a few more tricks they were already using).

Of course, all of this can be tricked too, it's all a game of cat & mouse, trying constantly to outsmart the other side.


And there goes your accessibility.


I don't believe Google's Recaptcha2 has been broken yet (at least not publicly). Though you can parcel it off to low wage workers who click on captchas all day long.


Interesting. I've done a lot of scraping and I never did this. Do you have a treatise that explains how it's done? I've actually been experimenting with headless Chrome recently because I like it more than PhantomJS.

Practically speaking I don't think IP detection is useful at all these days, and the only Captcha that can't be bypassed is Google's most recent version. The much more successful anti-scraping tactic is to use a sophisticated reverse proxy that analyzes all requests to identify patterns and unusual behavior, indiscriminate from the IP source (because any large scale scraping will come from many IPs anyway).


It's not terribly hard using Selenium; you can use one of their Docker images (e.g. selenium/standalone-chrome) and connect to that using Webdriver.

Having said that, it has flaws compared to headless Chrome (more moving parts, terrible security, extra memory usage / dependencies for the JVM to run Selenium) so unless that camouflage is proving crucial, I'd avoid it where possible. I've recently migrated a project to headless Chrome and definitely prefer it.


I'm curious on what makes the security worse. The rest, I can see. Though, I have to confess I have been more critical of the whole headless browser binary push than I'd care to acknowledge. I just feel like this is solving a problem that didn't actually exist anymore. Selenium has worked solidly for quite a while. And, you almost certainly should test in multiple browsers/scenarios.


The main security issues I know of so far are:

  - Neither Selenium nor Webdriver (at least the Python client, but I assume others too) support HTTPS at all.
  - By default it opens browsers configured to accept any SSL certificate. I can see why that's useful for local testing, but it's a terrible choice to default to.
  - It logs way too much. Logs every keypress sent to it, including passwords.
At least the latter two are fixable, but not trivially so, and better defaults with easy client-side options to opt into insecure mode would have been much better. The former is not fixable without patching both the client and server and in 2017 it's pretty poor to not even have an option for.


Odd, I'm pretty sure both of the first two are wrong. We used to have to make sure we had legit certs for selenium to work with https. That or configure the Firefox profile to have already accepted the self signed one.

And, really, the first two points are clearly contradictory. I'm guessing I just misunderstand what you mean?

Do you mean the communication between the driver? I'm curious why ssl would be important there? Should just be done locally and a standard ssh tunnel can help with any remote encryption you might want.


The first two are not contradictory, the first refers to communication between the client and server and the second to the browser's communication with a remote server. Maybe that wasn't clear from how I wrote it.

On further investigation, I think I was wrong on the second though; it might be that it's built into Chromedriver, not Selenium, which would explain why you didn't have the same issue with Firefox.

SSL between the client & Selenium server would be of benefit to keep any bad actor on the network from man-in-the-middle attacks on it. I'm not familiar with how I'd set up an ssh tunnel for that, I'm happy to believe it can be done, but it'd be a lot easier if it just supported HTTPS to begin with.


Apologies, I should have relaxed more of my verbiage. I was writing on my phone, and only about halfway through my post did it "click" what you meant.

HTTPS would be an odd choice, if only because I don't want to do any cert management for my selenium test runner. An encrypted channel makes some sense, but I'm not sure what the best mechanism for the shared secret would be. Ssh kind of gets you there, but is not as straight forward, as you ntoed.

Instead, I'd urge to keep the communication at localhost (or tunneled over SSH, at worst). Preferably with a locked down security model on the network so that you will see any traffic going off.


We both must be misunderstanding, because those first two points are blatantly false as far as I can tell. Selenium can handle(?) invalid SSL certs but the defaults certainly don't freely accept them.

As for the third point.. That's why we have DMZ's..


> and the only Captcha that can't be bypassed is Google's most recent version

That can also be easily bypassed (I’ve done that, without actually planning to do so) if you can make your behaviour seem completely human.

So if you have a bot that can browse in a way that seems statistically human you can actually get around that. You can still scrape the same datasets – you just need to keep instances of bots, with all cookies around, and have them browse in certain ways. Classify what category a site might likely belong to, put them into buckets of queues, and have bots pull from queues they’re likely to browse, or search for certain terms they’re likely to search for.

In my case, this all happened by accidents – I had IRC bots for a few channels, each kept a perpetual session and would visit every site that was linked (to get the title) and would be able to search Google by scraping. One day I was accessing a bot remotely, and told it to access a page that was NoCaptcha protected (because the site wasn’t working on my home system), yet it passed the captchas perfectly fine. Tried a few more times, always worked. So I tried figuring out why it worked.


Were the irc bots able to render javascript or were the bots using a cli lib like python requests?

I thought that a client which doesnt run javascript would be a huge red flag.


They used Firefox with off-screen rendering, as otherwise they wouldn’t be able to get the title of many pages – even many blogspot blogs can’t be read without JS anymore.


If you know how to interface with Chrome headless then you're 95% there already. Just run xvfb, start a Chrome instance on the DISPLAY with the appropriate flags (for exposing the debugging port) and you're done.


IP detection, especially IPv4, should still work fine because the cost to get a new IP will often outweigh the benefit of the scraping.

Having said that, just let them scrape?


There exists an entire industry that provide rotating IPs across fleets of proxies for very cheap prices. Proxy services like this make it pretty much impossible to prevent scraping.


>Having said that, just let them scrape?

Depends on who you are. Hotel and airline websites, for example, are pummeled with scrapers wanting pricing and availability info. Letting them scrape with no limits is costly.


And why is that a good thing for them? I mean if they provided an API to get their pricing, wouldn't it be in their interest? Does anyone buy airline tickets from the airline website ever? I don't. I use sites like Expedia and Priceline. And if an airline is not listed or has shitty prices, I don't buy it. Wouldn't it be in their interest to be listed on there? Moreover, wouldn't they want the other companies to have similar API's to have dynamic pricing?

I get that at one point not having this info out there was a good thing for them. But now the car is out of the bag. You can't keep pretending that booking sites don't exist.


Lots of wannabe sites with heavy scraping, but no sales.

There are several airlines where 50+% of all sales are on their own website. Lowest distribution cost.


> Does anyone buy airline tickets from the airline website ever?

All other things being equal, I will strongly prefer buying directly from the airline, because I've personally experienced the shifting of responsibility if things go wrong during a flight.

I've also been denied boarding (not in my home country) because the airline claimed that the OTA (Netflights) didn't pay for my ticket, leaving me to spend the night in the airport and having to book a one-way flight with a different airline the next morning.

That was a horrible experience I'm determined to not repeat.


Adding to tyingq's comment, not all scrapers are efficient, either.

Also, if you don't have the rock-bottom cheapest prices, you may not want to cater to aggregators that promote those.


People who have miles or points buy airline tickets directly from the airline.


> Hotel and airline websites, for example, are pummeled with scrapers wanting pricing and availability info.

Yes, and they make it very cumbersome to discourage it. I've successfully written scrapers for airlines. It's significantly more difficult than crawling other websites for a few reasons:

1. Session management is wacky - they really like to manage state entirely through cookies, and you typically need to visit a specific set of pages in a specific sequence before you can access the resources you want, like the number of seats available or their prices.

2. Sessions have time limits because anyone who looks at the seats initiates a "soft" reservation on them (this works in a similar way for theatre, concert and movie seating).

3. You don't usually have nice JSON endpoints, so you'll be doing a lot of HTML parsing (which, given the type of HTML you encounter, can be hell).


>You don't usually have nice JSON endpoints, so you'll be doing a lot of HTML parsing

That is changing. Most scrapers haven't caught on, but the more modern things airlines are pushing out (their mobile sites and native mobile apps) often have really nice REST/JSON api interfaces. The scrapers are often still scraping the old desktop site which will be the last to get that underpinning.


Yes, whenever I'm looking for a source to crawl I prefer mobile applications for precisely this reason. Request signing and certificate pinning are an upfront annoyance, but the maintainability is far higher.


There are multiple ways to hide your IP for free: Public proxies, VPNs which are not so well-known, and lastly there is a wealth of browser extensions which aim at hiding your IP to some extent. This would be one example where your traffic will exit with actual users on consumer ISP lines: http://hola.org/


I'd say don't use Hola. They've had some issues, like selling their users bandwidth for botnets[1].

If you do some googling, you will see many reasons not to use that extension.

[1]http://www.theverge.com/2015/5/29/8685251/hola-vpn-botnet-se...


What about tor.


TOR is an easy one to detect. Case in point: Try to browse the web with TOR and see how many CloudFlare captchas you have to solve in the first 10 minutes ;)

Also TOR has public lists of exit nodes (https://torstatus.blutmagie.de/), and unless you exit via a non-listed one you're trivially identifiable.

Lastly, TOR is rather slow, which prohibits using for large-scale scraping tasks. Plus you get switched around different exits and countries, which might break your scraping logic.


I believe you could use the ExitNodes option to only use a specific exit.

There's no such thing as an unlisted exit node, only unlisted entry nodes (bridges). I agree that Tor isn't a good choice for scraping.


Tor would introduce pretty intolerable latency into a professional data crawling setup.


I'm thinking OP meant if you want to never be detected, but like you said, you usually never get detected anyway.

Bypassing captcha automatically or with captcha bypass services where you pay per captcha completion?


The simplest and most effective trick is to have a link that is not visible to humans but only to crawlers. When an IP ends up on that link you know it's a crawler. You can then proceed to block it. A honeypot in other words.

Typically display none will do but you can also have white on white text with no tabindex or off screen absolute positioning, etc.


This is harder to do than you think. An industrious developer who works on these things will quickly notice that their distributed crawler has started failing, look through logs and ultimately identify the problem. Then they'll switch to another set of IPs and continue, this time without requesting that link. The other issue is that you need to set an explicit rule that is aware of how each and every API endpoint should be accessed, and whether or not it should logically be directly accessible.

I say this as someone who has had to combat this specific technique - I'd suggest that if you believe it works, it's probably because you saw obvious scraping activity stop when you did it, but you were never aware of the more professional scraping that adapted to it or was never caught by it in the first place.

Whenever I've scraped a website, I am extremely careful not to request more pages than I need. A better method for blocking them is to flag requests for resources that do not proceed in a logical manner. For example, if you have an API endpoint that displays the information scrapers want, that endpoint should have a specific "route" through the user interface. If you find requests directly to that resource without first proceeding through the typical UI flow, that is more accurate for identifying a scraper.

This is still not foolproof, because the scraper can just script requests to the requires series of pages in order. But it's a good start for getting rid of most scrapers. The most effective method for getting rid of scrapers is IP agnostic behavior analysis, because it can catch e.g. scrapers trying to parallelize requests that increment across a proxy farm or requests that don't obey typical behavior constraints in the UI.


Be careful with that. Users with screen readers may fall into your trap.


Could you have the content of the page explain that its a bot trap (so screen reader users know its not worth visiting) and only block if, say, an IP visits it multiple times in a short window?

Wouldn't block specialised scrapers as they would know to avoid that URL/link (though they'd probably work it out anyway), but would still limit more broad crawlers.


I do something similar to this on PPC campaigns, but not for the purpose of blocking IPs. Bots tend to come in waves from a given site. If more than X% of clients from a given referrer wind up on the fake link (or other techniques show that the clients are bots), my code can automatically suspend the display of ads on the offending site for a period of time. If it continually happens, then the site's owner is the likely culprit, so I have a threshold at which that site is automatically and permanently blacklisted from all of my campaigns.

I have found that advanced bot mitigation is the single most significant determining factor of PPC ROI, which is a sad commentary on the current state of the paid advertising ecosystem.


Can you comment more on the results? Like, how many of your clicks come from bots?


Depends entirely on the niche and traffic source. I have abandoned some niches because even with automated, realtime suspension of campaigns/referrers, there were so many bots (more than 50%) that I couldn't make the niche profitable. There are two primary sources for bots that click on PPC ads: competitors looking to drain budgets, and site owners looking to profit by having bots click on their ads. The latter is the easiest to defend against, since you can simply blacklist their site(s), plus any other sites that are likely on the same server (fortunately, most criminals are pretty lazy/cheap in this regard). I know one marketer that auto-blacklists all domains that have private Whois information after 1 click, and he has done very well with that.

To answer your question, based on my personal experience, I'd say on average, for display campaigns (specifically not referring to Google search ads, which tend to have a lower percentage of bot traffic - while Bing is the Wild Wild West)...I'd say overall it's somewhere in the 30% neighborhood. Not all of those have malicious intent, but in PPC, every non-human click on your ads is malicious.

Regardless of niche, realtime bot mitigation is probably the best competitive advantage that one can have in the PPC arena.


Jesus that's a lot! Did you tried with FB ads?


That sounds like a great trick, but I'd want a solution against someone getting large swathes of IP's banned, because the bot crawled my site and posted the link somewhere.


Determining link visibility is a standard crawler thing. So you'd need to be way more clever than that to succeed against someone who is moderately clueful.


Safari has navigator.webdriver, and others will soon follow. The idea was first introduced as a Fx bug, but hasn't gone any where.

https://bugzilla.mozilla.org/show_bug.cgi?id=1169290

There's a webdriver protocol mailing list with a long thread that describes how it should be implemented and how people could avoid it; ex. recompile without the feature. I just can't find it right this minute.


Yes, it will do nothing against sophisticated botters.

It has been a standard tactic to run browsers in purpose built VMs. The last bastion were smarter, more disruptive captchas and rendering performance profiling, but even that got grinded by now.

The situation is even more grave with mobile ads as you have zero opportunity to shove captchas, or rely on single unique per-system ID or cookie that was vetted outside of embedded webkit environment.


I've done similar detections. Basically just use keyboard dynamics: by measuring the precise timing of keystrokes (key down, key up) you can tell if it is likely typed by a user manually.

I haven't done it but I think mouse movements could be used too. Add onMouseMove event listeners all over the page and see how they are triggered.


Yeah, that's why I use Math.random() * 5 and similar stuff in my scrapers.

A lot of people access the web through smartphones so only allowing people with mouse movements seems a bit overkill.


Do you also use OCR?

Because I guess you could defeat webscrapers by making the HTML structurally totally different from what is visually present in the browser window.


As has been mentioned elsewhere in the thread, you need to be careful about accessibility and screen reader access.


There is a sevice for solving captchas. Death by captcha is one


Long time ago we used Crowbar with xvfb. Nice to see the same method still works.

http://web.archive.org/web/20121127001253/http://simile.mit....


Long time ago we used Crowbar with xvfb. Nice to see the same method still works.

http://web.archive.org/web/20121127001253/http://simile.mit....


One can also use a microcontroller such as Teensy to script timed keyboard and mouse action.

Demonstrations have usually been by security researchers, but these tiny boards can be used wherever one wants to avoid the labor of repetitive navigating, typing and mouse clicking.

http://samy.pl/usbdriveby


> ... to automate malicious tasks. The most common cases are web scraping...

I really don't think scraping should fall onto that list.

There isn't even a consensus in the IT world whether or not scraping should be able to be legally restricted.


I came here to say that too. Scraping has a great many legitimate uses. Search engines, scientific research, trying to use publicly available data that doesn't have an API. I've had to scrape government websites quite frequently because they often make public information hard to read by other means.

That last one is an interesting one. I think one of the most effective way to deter a scraper might be to just provide an API!

Now if you were using the scraped data to republish (copyright infringement) or use it to gain a competitive advantage (re-pricing in eCommerce comes to mind) that is a different story.


This is actually an interesting point. If you implemented an effective scraper-detection API, you'd run risk of locking out search engines too.

(Though I guess the real-life solution would be both simple and depressing: make an exception for googlebot and don't care about anyone else)


Yeah I've seen a lot of sites that explicitly state that, as well as in their robots.txt

All robots forbidden, except googlebot


... which rule would be obeyed only by legitimate, robots.txt-honoring crawlers. This reminds me of the anti-piracy messages shown (solely) to viewers of legally-purchased media. Similar "logic", similar (counter-productive) "effectiveness".


Which is why some search engines simply ended up following only the directives given to googlebot, and ignoring the rest.


> All robots forbidden, except googlebot

After which people keep claiming google search is better, not just given special treatment.


> re-pricing in eCommerce comes to mind

what's re-pricing?


Re-pricing is the practice of scraping your competitors and pricing your product just a little lower so that in price comparison searches you always show up at top.


Google also does it all the time...

The author has certainly made his position clear, and I disagree with him too.


You can control Google's scraping by specifying a robots.txt, which most people won't do because they want Google to index them. The article means parties who don't care about what you want.


Google is allowed to do it.

There are laws to allow for indexing of contents.


There are? Can you point us to an example?


There aren't. At least, certainly not beyond any particular government-operated websites, and even then. There are contracts and terms of use, though. You can block anyone you want from accessing your site, and tell Google they have to pay for the privilege. Think Twitter a few years back.


I don't think he made a value judgement. He didn't say all scraping is malicious. He just suggested that some scrapers can be malicious.


So again someone wants to punish all the legitimate people using a web site to get some marginal benefit from detecting the remaining <1%. The inevitable false positives don't affect the "malicious" users. Only the legitimate ones. And how much will this bloat the page load by? Adding more code to an already overly large page isn't helping anyone.

Just let the web be the web, and stop trying to control it.


Mentioned this in another comment, but for some websites, the scraping problem has real costs. Airline, hotel, stock prices,etc. For some spaces, scaling and paying bandwidth for unconstrained scraping is costly. And not restricting it hurts the legitimate users because the performance sucks.

There are also the scrapers blindly looking for vulnerabilities or other unsavory tactics.


While I agree with you to a degree most airlines and hotels have APIs that can be consumed to get pricing information there are just restrictions in what you can do with that information.

Not sure about stock prices (I think it's pretty common to pay for real time data there?).

But I can certainly see sites that have a lot of data for their users facing major bandwidth costs if a lot of people were scraping their data. This type of detection isn't really an answer for that, though, as it's easy to mitigate for a scrapper.


Also, what I've learned is how little regard for your site your scrapers often have, scraping as aggressively as possible.

You're just not always in a place to scale to the abuse or build something more complex than some simple heuristic filters.


> what I've learned is how little regard for your site your scrapers often have, scraping as aggressively as possible.

Often? Based on what data?

I find it much more likely you only often notice aggressive scrapers. That however tells you nothing about the behavior of the average web scraper or web scrapers in general.


The system encourages it. Ingress data is cheap, and so many scrapers just default to high frequency.


You're not taking into consideration the economical consequences some bots have on companies. Some bots are designed to make payments with fake or stolen credit cards. Some bots impact on people that need to manually check for submissions or takedown notices. Obviously I agree that there are legitimate use for bots and scrapers, but that admittedly low percentage of fraudulent use cases do cause a lot of harm.


Some bot can and will use a real browser, in a real window, opening many sessions, randomising usage patterns to look more human, and continue doing what they already do.


if a human is allowed to do something on a site, it goes to reason that a bot should be allowed too (granted using the same access frequency as a human).

Blocking scraping is like DRM. Don't do it. Use a legal mechanism to deal with copyright infringement, and use acceptable usage policy to deal with heavy users that are using more than their "fair" share of bandwidth.


Some people just hire actual humans to do this, for cheap.


This looks like a list of bugs that need fixing; ideally, headless Chrome should be completely indistinguishable from ordinary Chrome, so that it gets an identical view of the web.


It depends on the target audience. For Google (and for most people) the goal of Headless Chrome is to offer an easy and feature-complete way of automatically testing websites, e.g. for performance (PWA are all the craze) and bugs. For those folks, it doesn't matter that you can detect the Headless Browser, it only matters that it's working like the regular one 99% of the time. This is a huge step-up from previous technologies like PhantomJS or laborious solutions involving webdriver and many moving components.

In some cases they don't even want it to behave exactly like the regular browser. As soon as your website uses any client-side state (cookies, IndexedDB, HTTP caching, service workers, local storage) you want to have to an easy "give me a clean and isolated browsing session" switch like Headless offers.

People scraping the web are not the target audience of this.


But to automate testing, you probably need working locale support, and APIs for images to function. Those items will probably stop working for detection purposes when they fix them.

I wouldn't be that surprised if they added webgl support later as well.


I tend to disagree with your stance on Google's incentives. I'm pretty confident (and have zero factual ground for this) it is exactly taking over the bots industry with a powerful headless browser, that pushed them to release headless chrome.

Mind you Google is extremely interested in bots...


And, indeed, even if it wasn't, you could always do your malicious scraping et al using regular Chrome with a remote-control extension or OS Accessibility API-based automation. It's kind of pointless to detect headless browsers specifically, if you'll still have the same problems from automated headed browsers.

Still, I agree that if people are going to try detecting headless Chrome, Chrome should strive to thwart that. The attacks in the OP seem like low-hanging fruit; I was expecting something more akin to timing attacks on how long Ready events take to fire given delays from actual rendering. Writing the code to imitate that would be a fun week.


Leaving aside for a moment that many "malicious" use cases are actually fairly common and totally legitimate.

Headless Chrome is awesome and such a step up from previous automation tools.

The Chromeless project provides a nice abstraction and received 8k start in its first two weeks on Github: https://github.com/graphcool/chromeless


> Beyond the two harmless use cases given previously, a headless browser can also be used to automate malicious tasks. The most common cases are web scraping

I guess I disagree with the premise of this article.

How is web scraping fundamental malicious?

What rights/expectations can you have that a publicly accessible website you create must be used by humans only?


It puts a load on you server when bots go wild on your site which in turn affects the experience of legitimate human users of the site.


Since when is web scraping a "malicious task"?


Read the ToS of most websites. :)


It's not as clear cut, and the law doesn't seem to be entirely clear either [0]. Many in the tech world don't agree that it should be [1].

I'd say this would be one of the most controversial things to refer to as 'malicious'.

[0] https://arstechnica.com/tech-policy/2017/07/linkedin-its-ill...

[1] https://news.ycombinator.com/item?id=14891301


There is a difference between reading ToS and accepting it. I am not obligated to anything just by downloading webpage from internet.


Most ToS don't forbid scraping for "malicious tasks", they just don't allow it for any task.


At what point does a tos stop being enforcable? Would something like by visiting you grant all copyright on any materials the visor has created to this site? Could you demand 10% of yearly revenue for any business that visits and be legally able to retrieve the funds?


I don't need to accept those to use the site. (in most cases)


If someone wants to scrape your site he will do it, just find workarounds against your "protection". It is impossible to tell the difference between a real user and an automated scrape request, you can only make their job a bit harder.


True. Then again, the cost for the scraper can be raised significantly by changing your obfuscation / anti-scraping methods frequently. All of the sudden a scraper will need close monitoring to ensure his scripts / regexes are still working and he will likely need a person dedicated to implementing new workarounds as soon as the sites-to-be-scraped push out a new obfuscation method.


In which case, why not just provide a paid API? The content provider will then make extra revenue that would otherwise go to the endless arms race.

As others have mentioned, there is nothing (that I know of) that can thwart a motivated and resourceful scraper.


Some services have actually gone down that exact road. pastebin.com is one of them.

But in other cases you simply don't want anyone to be able to extract your info automatically. A good example would be e-commerce sites which don't want anyone to be able to scrape their pricing information in large scale and real time.


Maybe, anti scraping is already a multi million dollar industry with active development from startups and bigger names like Akamai.

There are plenty of websites that want scrapers to disappear.


I wonder how many of these were deliberate, and how many were missed. Google has a vested interest in bot detection.

And by releasing headless chrome, they killed off some of the competition. (https://groups.google.com/forum/#!topic/phantomjs/9aI5d-LDuN...)


Google also has an interest in making an undetectable bot - for their search engine. Undetectable in the sense that GoogleBot should see the exact same page that humans see. They have been using chromium based code, at least on certain sites, for a while now. I wonder if this person is damaging his rankings with these techniques...

Of course, Google announces itself as GoogleBot. It wouldn't surprise me if they did a second stealthy crawl to detect cloaking. (But I think they are honest when they say they don't, and just throw cheap human labor at it instead by having people browse suspect sites.)


> Of course, Google announces itself as GoogleBot. It wouldn't surprise me if they did a second stealthy crawl to detect cloaking. (But I think they are honest when they say they don't, and just throw cheap human labor at it instead by having people browse suspect sites.)

They actually run secondary tests that aren’t GoogleBot. They’re quite easy to detect on very low traffic sites. If you only have a few hundred users, all of which you know personally, and suddenly over the range of a few hours a few users using Chrome visit the page, while it’s not linked or findable in any search engine, just shortly after users using the googlebot UA visited it, and with certain usage patterns – it’s quite obvious.

Detecting the Android Bouncer’s VM is equally easy, although that only happened by accident because, due to an automated action, my app crashed in that and submitted a crash report that was unusual, and I managed to extract parameters that’d allow detecting it (similar with other android virus scanners), but I only cared about that to be able to split those "devices" into a separate category in the crash tracker (all my apps are GPL licensed, and don’t do anything evil anyway)


I don't want to start an argument here, but can someone explain why web scraping is considered malicious?


I don't believe that responsible web scraping is malicious, but my belief relies on the assumption that the web is meant to be open. That is, information put on the internet that is publicly available is considered free to access, store, and later retransmit. The original web was designed for researchers to share their work, while the modern internet built on top of that platform has other moral systems that don't necessarily agree.

Anyway, on a purely technical level, scraping of publicly available content isn't inherently bad unless you're asked to stop, or are scraping so quickly as to cause a service disruption by tying up the target systems. There is nothing malicious about generating normal traffic at the rate of a regular user. The animosity arises from what you plan to do with the data, and whether the entity you're scraping agrees with your usage.


Thanks, that's what seems obvious to me too that it's just public data and it's possible to collect it without overwhelming the server with requests. I just don't get why someone wouldn't want you to look at their website.


Many websites have ToS's that forbid scraping.


How many of these can be faked with some additional code with Chrome headless?

Regardless as others are saying, using complete Chrome or Firefox with webdriver solves all these, right? Is there a way to detect the webdriver extension? That's the only difference I think from a normal browser.


> How many of these can be faked with some additional code with Chrome headless?

All of them. As soon as you can run some JS code before the page does, every single difference can be monkey-patched. There's no way to distinguish native APIs from fake APIs made by someone that knows all ways of detecting them.


Right yeah, of course.


or you know, run chrome in a real desktop environment. sure it might take more resources, but that's definitely cheaper than tracking down the places in chromium where you need to make the changes.


Is it? I was talking about it being more in a collective fashion and pushed into the code or done in a way where it would be easier to integrate it into Chrome headless.

I personally already do a lot of Chrome and Firefox in real desktop environment. I love doing it this way. I know I can mimic a real user. It gives me great comfort.

Still, it's only cheaper if you're doing all the tracking down yourself. And I never meant for that to be my point.


> var body = document.getElementsByTagName("body")[0];

You can just use document.body.

I also suggest to use a data URL instead. E.g. "data:," is an empty plain text file, which, as you can imagine, won't be interpreted as a valid image.

  let image = new Image();
  image.onerror = () => {
    console.log(image.width); // 0 -> headless
  };
  document.body.appendChild(image);
  image.src = 'data:,';
> In case of a vanilla Chrome, the image has a width and height that depends on the zoom of the browser

The zoom doesn't affect this. It's always in CSS "pixels".


Shouldn't the first block of code have "HeadlessChrome" instead of just "Chrome" as the search term?


You're right, I changed the code.


I do hope that these methods get patched, I tend to archive my bookmark collection with chrome headless to prevent loosing content when such a site goes offline. I hate it when a website requires me to play special snowflake to scrape them for this purpose.


dumb question from someone who's written a ton of scrapers and scraping based "products" for fun:

at what point does it make more sense for companies to just start offering open APIs or data exports? Obviously it would never make sense for a company who's value IS their data, but for retail platforms, auction sites, forum platforms, etc... that have a scraper problem, it seems like just providing their useful data through a more controlled, and optimized, avenue could be worth it.

The answer is probably "never", it's just something that comes to mind sometimes.


The irony of using JavaScript to detect scraping or bots when the majority of them not used to trick ads don't ever execute any of it because they are a better curl.


Well, if you're determined to prevent scraping, it's rather easy to hide content from non-JS bots: simply pull in the content via Ajax or "encrypt" is and perform the decryption via JS.

So thinking about how to ward off bots that do go the extra mile makes sense. (From a scrape-protection POV at least)


And it's actually getting easier with every new shiny web API. Want to make sure only the latest Chrome can retrieve the content of your website? Why not run a Webassembly computation that will yield the correct URL to fetch. Or what about a Web Worker? There are endless possibilities, and the only sane way to scrape / index the web in 2017 is a full-fledged browser.


If you try too hard then you can accidently hide content from search engines too.


All of these could quite easily be overcome by compiling your own headless chrome. It wouldn't surprise me if there is a fork to this effect soon.


Those who want a more "authentic" experience would do better to use a real normal browser, and control it from outside.


I'd be willing to bet that missing image size variance is more of a bug or oversight, and is something that will be fixed.


"Beyond the two harmless use cases given previously, a headless browser can also be used to automate malicious tasks. The most common cases are web scraping, increase advertisement impressions or look for vulnerabilities on a website."

Cheating an advertiser I'll grant you, but the other two are 100% legitimate.


"... a headless browser can also be used to automate malicious tasks. The most common cases are web scraping... "

Since when web scraping considered malicious? Companies like Google are doing billions because they use web scraping.


What about mining cryptocurrency on a page load as a solution against scrapers?


That's like the people pushing Github PRs which will mine $coin in the CI process. But seriously, you can do that, but scrapers will have short timeouts anyway before they abandon the page or consider it loaded, so there's probably not much to be made in terms of profit.


There are companies that use PoW functions as a deterrent against scrapers. Similar but not mining Bitcoin exactly.


Isn't it possible to detect a bot by tracking some events like random mouse moving, scrolling, clicking etc.? Why weren't these kinds of detection tried in place of captchas, for example?


Because they are easily faked.

Google’s current captcha system does track a few of these, but it mostly takes your browsing history, and, if that seems normal, will accept you.

I’ve run a few IRC bots that allowed people to submit Google searches, and would return the first resulting link. They also fetch any link mentioned in IRC channels, execute the JS, and after a timeout of 400ms respond with the current page title.

Both combined – a normal search history, reading a few hundred pages and videos a day per user – apparently are enough that they seem "human", and can pass NoCaptcha.


Can you guys shut up already?


[flagged]


No, downvotes confirm that you are stigmatizing a good chunk of the readership of this website who will not care to read further into your comment than the first 7 words.


He lacks tact, but has a point.


Thanks. Just to clarify, I believe that if someone posts code snippets like this, purporting to do thing X by doing roundabout and overly simplistic thing Y, you really have to own it and assume tons of responsibility for misleading fellow developers who may be unaware of possible edge cases. When somebody copy-pastes a quick solution and it doesn't work for literally everybody, that translates to real people getting a bad deal when they hit your bugs, something we should all as responsible developers seek to avoid.

And yeah, I will not waste time being tactful when describing this situation, because I have seen it before and it kind of sucks.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: