Hacker News new | past | comments | ask | show | jobs | submit login

This should be made very clear to Cloudflare users, ideally a warning next to the Always Online checkbox.

"Always Online" now can mean "Archive Forever" - even when a site is pre-launch.




From the blog post, an image of the checkbox: https://lh6.googleusercontent.com/J42AtNZv8xNcyQPPefVywiAGEh...


It always has. If your site is publicly available and you don't disallow bots through robots.txt, they can crawl it at any time. Even if the site is "pre-launch", because that doesn't mean anything on its own.


And of course, remember that robots.txt is only a signal to benevolent bots which respect it. If you have secrets to keep, don't put them online in the first place.


Looking at my Splunk logs and then asking a lot of questions, I have learned that there are a LOT of not so benevolent bots that must be tolerated anyway.

Benevolence is a continuum.


In fact robots.txt is a list of things a nefarious crawler will absolutely want to examine - no need to know those paths when they're all laid out for you!


Or properly authenticate (and audit) access.


Webarchive completely ignores robots for a few years now. They did it on purpose.


I have serious issues with this and the fact that site owners have to email a human support team in archive.org to be excluded.


I'd just add that, while major players like the Internet Archive do respect robots.txt, it's essentially just a flag that depends on people voluntarily respecting it. If a site is publicly available but you don't want people to find it, you're just depending of security through obscurity.


Internet Archive stopped respecting robots.txt since 2017. See https://boingboing.net/2017/04/22/internet-archive-to-ignore...


Yeah, I definitely expect this to bite some people, if I'm understanding correctly. A plausible scenario (among many) would be: soft launch a site, show it to some early stakeholders, have Wayback archive everything via Always Online, fix embarrassing screwups or oversharing in soft-launched version, publicize site more broadly, everyone in the world can rewind to version zero, regrets. I don't think the existing warnings really make clear that a soft launch is now a forever launch.


The solution to this is... robots.txt. Otherwise your site might turn up in Google etc. Since it's archive.org that's doing the crawling and they respect robots.txt it won't get archived.


Archive.org does not respect robots.txt IIRC. I’ve run into this problem before with them. Ironically, I ended up blocking Internet Archive’s ASN using Cloudflare.

EDIT: Internet Archive started ignoring robots.txt in 2017: https://www.digitaltrends.com/computing/internet-archive-rob...


They only started ignoring robots.txt on US government websites (as that article also says)


That is not what the article says.

It says Internet Archive had already started ignoring robots.txt on US government websites.

Now (since 2017) they ignore it on all websites.


I think that's fine. The reason we fix screwups is so the next people who arrive don't see them. We don't fix screwups to hide that sometimes we fuck up. If someone goes out of their way to find old screwups, then so be it. As long as not the majority of people see it, we're mostly fine.


Most people password-protect this. It's very common. If you contract a webdev for something, he will recommend it for you 100%. Not the basic auth thing, just a shared secret. Something trivial.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: