Come to think of it, we moved offices 3 times since then, must've been 8-10 years ago. I don't think I had to do any special trickery, I spend only an afternoon or so writing and testing the code. I didn't realize such a thing would be impossible now - what a shame. I downloaded several gigabytes iirc - a big amount at the time.
Though now a day you could use Common Crawl to get the dataset and use existing tools to extract such files, right? (I've no idea if that's a practical thing to do or not.)
I guess so, if they "look" at the web the same way Google does (respecting robots.txt, nofollow etc - which Wikipedia says they do). But the interesting things are found in nooks and crannies where nobody else has thought of looking before - so relying on someone else to do the heavy lifting is probably the wrong way to go about it...