Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> I will often get "fingerprinted" with a very valid browser, then continue my work after treating their site as an API; ie: talking directly to the backend (or rendered HTML page) and no one else.

This seems like just the kind of behavior that an ML approach could easily identify. Even just feeding a trained model your basic request log data would show you as quite a different kind of user: you're not fetching images, javascript, etc and would have a substantially different traffic profile. Obviously you could get around that by scripting a browser, but that just kicks the can down the road: a scripted browser will still likely behave in some measurable way different than a human-driven browser, and the specifics of those differences are unlikely to be found by intuition but rather by ML which can hone in on things we wouldn't think of. For example, the time spent in certain pages/activities, or the position of the cursor on links being clicked (when you tell a headless browser to click a link, do the mouse coordinates in the click event look normal or are they at the upper left coordinate of the link's position?) etc. The more data you feed such an approach the more details it can use to find anomalies that differentiate humans from bots.



Totally agree but the thing is it's a TON of sophistication to pull that off and block proactively to the point that I'd say the only people who employ methods like that are Amazon (that I've encountered). Typically when you do that you will spoof a crawler and it gets you by just fine.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: