Hacker News new | past | comments | ask | show | jobs | submit login

From reading Drew Devaults angry post from earlier this week, my take is that not only is it poorly implemented crawlers, it's also that it's cheaper to scrape, rather than keep copies on hand. Effectively these companies are outsourcing the storage of "their" training data to everyone on the internet.

Ideally a site would get scraped once, and then the scraper would check if content has changed, e.g. etag, while also learning how frequently content changes. So rather than just hammer some poor personal git repo over and over, it would learn that Monday is a good time to check if something changed and then back off for a week.




That seems crazy - millions of $ on GPUs but they can't afford some cheap storage? And direct network scraping seems super high latency. Although I guess a massive pretaining run might cycle through the corpus very slowly. Dunno, sounds fishy.


I see ChatGPT's bots pull down all of my Python wheels every couple of weeks.

Wheels that haven't changed in years, with a "Last-Modified" and "ETag" that haven't changed.

The only thing that makes sense to me is it's cheaper them to re-pull and re-analyze the data than to develop a cache.


Or, could it be, just possibly be (gasp), that some of the devs at these "hotshot" AI companies are just ignorant or lazy or pressured enough, so as to not do such normal checks? Wouldn't be surprised if so.


You think they do cache the data but don't use it?

For what it's worth, mj12bot.com is even worse. They pull down every wheel every two or three days, even though something like chemfp-3.4-cp35-cp35m-manylinux1_x86_64.whl hasn't changed in years - it's for Python 3.5, after all.


>You think they do cache the data but don't use it?

that's not what I meant.

and it is not they, it is it.

i.e. the web server, not bots or devs on the other end of the connection, is what tells you the needed info. all you have to do is check it and act accordingly, i.e. download the changed resource or don't download the unchanged one.

google:

http header last modified

and look for the etag link too.



It's not that they can't afford storage.

It's that not doing so means they can increase their profit numbers just a skoshe more.

And at least as long as they haven't IPOed, that number's the only thing that matters. Everything getting in the way of increasing it is just an obstacle to be removed.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: