Yeah - there's a ton of immutable URLs on the web. All of CDNJS/JSDelivr/Google Hosted Libraries, most of raw.github.com, all Imgur & Instagram images, all YouTube video streams (excluding annotations & subtitles), and all torrent file caching sites (like itorrents.org). There are probably some large ones I'm forgetting about, but just mapping immutable URLs to IPFS could probably cover 1/3rd of Internet traffic.
IPFS archives is the effort that's going on right now to archive sites. Eventually there will be a system for automatically scraping & re-publishing content on IPFS.
Right now storage space and inefficiencies in the reference IPFS implementation are the biggest problems I've hit. Downloading sites is easy enough with grab-site, but my 24TB storage server is getting pretty full :( ... Gotta get more disks.
Say you grab a site. How do you announce that fact, verify that it is an unmodified copy, sync/merge/update copies and deduplicate assets between different snapshots?
Really?
An IPFS user will scrape HTTP content, republish it over IPFS and update a directory of where non-IPFS content can be found over IPFS?
I must have missed that part.