It always has. If your site is publicly available and you don't disallow bots through robots.txt, they can crawl it at any time. Even if the site is "pre-launch", because that doesn't mean anything on its own.
And of course, remember that robots.txt is only a signal to benevolent bots which respect it. If you have secrets to keep, don't put them online in the first place.
Looking at my Splunk logs and then asking a lot of questions, I have learned that there are a LOT of not so benevolent bots that must be tolerated anyway.
In fact robots.txt is a list of things a nefarious crawler will absolutely want to examine - no need to know those paths when they're all laid out for you!
I'd just add that, while major players like the Internet Archive do respect robots.txt, it's essentially just a flag that depends on people voluntarily respecting it. If a site is publicly available but you don't want people to find it, you're just depending of security through obscurity.
Yeah, I definitely expect this to bite some people, if I'm understanding correctly. A plausible scenario (among many) would be: soft launch a site, show it to some early stakeholders, have Wayback archive everything via Always Online, fix embarrassing screwups or oversharing in soft-launched version, publicize site more broadly, everyone in the world can rewind to version zero, regrets. I don't think the existing warnings really make clear that a soft launch is now a forever launch.
The solution to this is... robots.txt. Otherwise your site might turn up in Google etc. Since it's archive.org that's doing the crawling and they respect robots.txt it won't get archived.
Archive.org does not respect robots.txt IIRC. I’ve run into this problem before with them. Ironically, I ended up blocking Internet Archive’s ASN using Cloudflare.
I think that's fine. The reason we fix screwups is so the next people who arrive don't see them. We don't fix screwups to hide that sometimes we fuck up. If someone goes out of their way to find old screwups, then so be it. As long as not the majority of people see it, we're mostly fine.
Most people password-protect this. It's very common. If you contract a webdev for something, he will recommend it for you 100%. Not the basic auth thing, just a shared secret. Something trivial.
"Always Online" now can mean "Archive Forever" - even when a site is pre-launch.