From reading Drew Devaults angry post from earlier this week, my take is that no...

QuadrupleA · 2025-03-20T16:48:38 1742489318

That seems crazy - millions of $ on GPUs but they can't afford some cheap storage? And direct network scraping seems super high latency. Although I guess a massive pretaining run might cycle through the corpus very slowly. Dunno, sounds fishy.

dalke · 2025-03-20T19:51:15 1742500275

I see ChatGPT's bots pull down all of my Python wheels every couple of weeks.

Wheels that haven't changed in years, with a "Last-Modified" and "ETag" that haven't changed.

The only thing that makes sense to me is it's cheaper them to re-pull and re-analyze the data than to develop a cache.

fuzztester · 2025-03-20T22:06:24 1742508384

Or, could it be, just possibly be (gasp), that some of the devs at these "hotshot" AI companies are just ignorant or lazy or pressured enough, so as to not do such normal checks? Wouldn't be surprised if so.

dalke · 2025-03-21T07:17:59 1742541479

You think they do cache the data but don't use it?

For what it's worth, mj12bot.com is even worse. They pull down every wheel every two or three days, even though something like chemfp-3.4-cp35-cp35m-manylinux1_x86_64.whl hasn't changed in years - it's for Python 3.5, after all.

fuzztester · 2025-03-25T00:25:30 1742862330

>You think they do cache the data but don't use it?

that's not what I meant.

and it is not they, it is it.

i.e. the web server, not bots or devs on the other end of the connection, is what tells you the needed info. all you have to do is check it and act accordingly, i.e. download the changed resource or don't download the unchanged one.

google:

http header last modified

and look for the etag link too.

fuzztester · 2025-03-25T00:43:03 1742863383

here you go:

https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/...

danaris · 2025-03-20T17:44:00 1742492640

It's not that they can't afford storage.

It's that not doing so means they can increase their profit numbers just a skoshe more.

And at least as long as they haven't IPOed, that number's the only thing that matters. Everything getting in the way of increasing it is just an obstacle to be removed.