Hacker News new | past | comments | ask | show | jobs | submit | more TonyTrapp's comments login

At least it's not mice for once.


Not the same thing at all if atop runs as root and you are a user on that system that has no root access. With a well-prepared exploit you could achieve code execution as root. That's a bit more than a simple Denial of Service by filling up the disk.


To be honest, it's in good company with real humans there: https://www.behance.net/gallery/35437979/Velocipedia

Maybe it learned from Gianluca's gallery!


We're affected by this. The only thing that would realistically work is the first suggestion. The most unscrupulous AI crawlers distribute their inhuman request rate over dozens of IPs, so every IP just makes 1-2 requests in total. And they use real-world browser user agents, so blocking those could lock out real users. However, sometimes they claim to be using really old Chrome versions, so I feel less bad about locking those out.


> dozens of IPs, so every IP just makes 1-2 requests in total

Dozens of IPs making 1-2 requests per IP hardly seems like something to spend time worrying about.


I'm also affected. I presume that this is per day, not just once, yet it's fewer requests than a human would often do so you cannot block it. I blocked 15 IP ranges containing 37 million IP addresses (most of them from Huawei's Singapore and mobile divisions, according to the IP address WHOIS data) because they did not respect robots.txt and didn't set a user agent identifier. This is not including several other scrapers that did set a user agent string but do not respect robots.txt (again, including Huawei's PetalBot). (Note, I've only blocked them from one specific service that proxies+caches data from a third party, which I'm caching precisely because the third party site struggled with load, so more load from these bots isn't helping)

That's up to 37e6/24/60/60 = 430 requests per second if they all do 1 request per day on average. Each active IP address actually does more, some of them a few thousand per year, some of them a few dozen per year; but thankfully they don't unleash the whole IP range on me at once, it occasionally rotates through to new ranges to bypass blocks


Parent probably meant hundreds or thousands of IPs.

Last week I had a web server with a high load. After some log analysis I found 66,000 unique IPs from residential ISPs in Brazil had made requests to the server in a few hours. I have broad rate limits on data center ISPs, but this kinda shocked me. Botnet? News coverage of the site in Brazil? No clue.

Edit: LOL didn't read the article unity after posting—they mention the Fedora Pagure server getting this traffic from Brazil last week too!

Rate limiting vast swathes of Google Cloud, Amazon EC2, Digital Ocean, Hetzner, Huawei, Alibaba, Tencent, and a dozen other data center ISPs by subnet has really helped keep the load on my web servers down.

Last year I had one incident with 14,000 unique IPs in Amazon Singapore making requests in one day. What the hell is that?

I don't even bother trusting user agents any more. My nginx config has gotten too complex over the years and I wish I didn't need all this arcane mapping and whatnot.


If you are serving a Git repository browser and all of those IPs are hitting all the expensive endpoints such as git blame, it becomes something to worry about very quickly.


That's probably per day, per bot. Now how does it look when there are thousands of bots? In most cases I think you're right, but I can also see how it can add up.


That would allow you to specifically lock out that bot based on its user-agent string. That's the main problem with AI scrapers, many of them normally use user-agents that cannot be easily blocked, so other means have to be found to keep them off your grounds.


There's source repository browsers (git/svn) way, way leaner than GitLab that have the same issues. Any repo browser offering a blame view for files can be brought down by those bots' traffic patterns. I have been hosting such repository browsers for 10+ years and it was never an issue until the arrival of these bots.


Indeed. It's really exposing a major downside to running applications in browser context. It never really made sense. These applications really don't want public traffic like actual websites do. They should remain applications and stay off the web. But more likely is that the web will be destroyed to fit the requirements of the applications. Like what cloudflare, etc, and all this anti-bot social hysteria is doing.


> They're looking for commits because it's nicely chunked, I'm taking a guess.

They're not looking for anything specifically from what I can tell. If that was the case, they would be just cloning the git repository, as it would be the easiest way to ingest such information. Instead, they just want to guzzle every single URL they can get hold of. And a web frontend for git generates thousands of those. Every file in a repository results in dozens, if not hundreds of unique links for file revisions, blame, etc. and many of those are expensive to serve. Which is why they are often put in robots.txt, so everything was fine until the LLM crawlers came along and ignored robots.txt.


As someone that is also affected by this: We see a manifold increase in requests since this LLM crap is going on. Many of these IPs come from companies that obviously work with LLM technology, but the problem is that it's 100s of IPs doing 1 request, not 1 IP doing 100s of requests. It's just extremely unlikely that anyone else is responsible for this.


> IPs come from companies that obviously work with LLM technology

Like from their own ASNs you're saying? Or how are you connecting the IPs with the company?

> is that it's 100s of IPs doing 1 request

Are all of those IPs within the same ranges or scattered?

Thanks a lot for taking the time to talk about your experience btw, as someone who hasn't been hit by this it's interesting to have more details about it before it eventually happens.


> Like from their own ASNs you're saying? Or how are you connecting the IPs with the company?

Those are the ones that make it obvious, yes. It's not exclusive, though, but enough to connect the dots.

> Are all of those IPs within the same ranges or scattered?

The IP ranges are all over the place. Alibaba seems to have tons of small ASNs, for instance.


We have seen competitors (big, well-known apps) do things on iOS that most definitely are not possible with public APIs. Either Apple willingfully provides access to these APIs to a select few companies, or they don't care that they reverse-engineer private APIs and then use them. If it's the latter, the competitor app was probably too big to be banned from the app store for this. Apple was unwilling to comment on the situation when we asked them.


> Its funny how "bad" we are as an industry, making rational choices. Like WHY do NVME SSD not implement TRIM? What is it about them that "TRIM" didn't make sense?

Maybe CrystalDiskInfo is simplifying things (combining TRIM and the mentioned DEALLOCATE command?), but all my NVMe SSDs support TRIM according to it. And it would be really strange if NVMe didn't support any sort of trimming, as SSD performance and health heavily relies on it.

It could also be that what the author is observing is specific to macOS?


RE: TRIM vs DEALLOCATE, that seems to be a naming thing. Wikipedia confirms that it's indeed technically DEALLOCATE on nvme.

https://en.wikipedia.org/wiki/Trim_(computing)#NVM_Express


The controllers on many USB SATA adapters don’t support TRIM.


That I'm aware of, but the claim is that TRIM does not exist for NVMe drives.


Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: