This should be made very clear to Cloudflare users, ideally a warning next to th...

judge2020 · on Sept 17, 2020

From the blog post, an image of the checkbox: https://lh6.googleusercontent.com/J42AtNZv8xNcyQPPefVywiAGEh...

jorams · on Sept 17, 2020

It always has. If your site is publicly available and you don't disallow bots through robots.txt, they can crawl it at any time. Even if the site is "pre-launch", because that doesn't mean anything on its own.

zeta0134 · on Sept 17, 2020

And of course, remember that robots.txt is only a signal to benevolent bots which respect it. If you have secrets to keep, don't put them online in the first place.

hinkley · on Sept 17, 2020

Looking at my Splunk logs and then asking a lot of questions, I have learned that there are a LOT of not so benevolent bots that must be tolerated anyway.

Benevolence is a continuum.

blacksmith_tb · on Sept 18, 2020

In fact robots.txt is a list of things a nefarious crawler will absolutely want to examine - no need to know those paths when they're all laid out for you!

archgoon · on Sept 17, 2020

Or properly authenticate (and audit) access.

f311a · on Sept 17, 2020

Webarchive completely ignores robots for a few years now. They did it on purpose.

AnonHP · on Sept 18, 2020

I have serious issues with this and the fact that site owners have to email a human support team in archive.org to be excluded.

ghaff · on Sept 17, 2020

I'd just add that, while major players like the Internet Archive do respect robots.txt, it's essentially just a flag that depends on people voluntarily respecting it. If a site is publicly available but you don't want people to find it, you're just depending of security through obscurity.

AnonHP · on Sept 18, 2020

Internet Archive stopped respecting robots.txt since 2017. See https://boingboing.net/2017/04/22/internet-archive-to-ignore...

JackC · on Sept 17, 2020

Yeah, I definitely expect this to bite some people, if I'm understanding correctly. A plausible scenario (among many) would be: soft launch a site, show it to some early stakeholders, have Wayback archive everything via Always Online, fix embarrassing screwups or oversharing in soft-launched version, publicize site more broadly, everyone in the world can rewind to version zero, regrets. I don't think the existing warnings really make clear that a soft launch is now a forever launch.

jgrahamc · on Sept 17, 2020

The solution to this is... robots.txt. Otherwise your site might turn up in Google etc. Since it's archive.org that's doing the crawling and they respect robots.txt it won't get archived.

symfoniq · on Sept 17, 2020

Archive.org does not respect robots.txt IIRC. I’ve run into this problem before with them. Ironically, I ended up blocking Internet Archive’s ASN using Cloudflare.

EDIT: Internet Archive started ignoring robots.txt in 2017: https://www.digitaltrends.com/computing/internet-archive-rob...

kalleboo · on Sept 18, 2020

They only started ignoring robots.txt on US government websites (as that article also says)

symfoniq · on Sept 18, 2020

That is not what the article says.

It says Internet Archive had already started ignoring robots.txt on US government websites.

Now (since 2017) they ignore it on all websites.

capableweb · on Sept 17, 2020

I think that's fine. The reason we fix screwups is so the next people who arrive don't see them. We don't fix screwups to hide that sometimes we fuck up. If someone goes out of their way to find old screwups, then so be it. As long as not the majority of people see it, we're mostly fine.

renewiltord · on Sept 17, 2020

Most people password-protect this. It's very common. If you contract a webdev for something, he will recommend it for you 100%. Not the basic auth thing, just a shared secret. Something trivial.