I'm not sure you would like the results of what you suggest - if you are really going to crawl everything indiscriminately, you will end up with a lot of rubbish. Just check out Common Crawl if you want to get an idea of what it would look like.
> Just check out Common Crawl if you want to get an idea of what it would look like.
It has a lot of rubbish, sure, but the reason that that matters with Common Crawl, is that Common Crawl isn't a continuous stream; it's rather a big monthly 100TB incremental deliverable, that makes up part of an even larger multi-petabyte whole dataset; where "using the Common Crawl dataset" mostly means relying on one of a few IaaS providers who've grabbed the whole thing and unpacked it into their serverless-data-warehouse cluster that you can run map-reduce jobs against.
A given consumer of this hypothetical web-scraping-results "firehose via a data lake" API, meanwhile, wouldn't need to drink from the entire firehose in order to "follow" live data. For many purposes, they could instead just drink from the much-lower-pressure URLs queue, to discover what has been scraped; and then schedule fetching just those things [or rather, the domain-and-time-bucketed archive-chunks that contain just those things].
Which, for many consumers, might end up a low-bandwidth-enough affair that the data could be delivered to them over the regular public Internet, without needing them to "move compute to data."
Consumers might still need a copy of the entire dataset to backfill their indexing system initially — and this might still require doing the "colocate to the IaaS cluster where the dataset is, and run a map-reduce job" thing — but that'd be a one-time bootstrapping process, not a periodic job that needs to be reliable.
(In fact, since it's so rare, the scraping-service provider could even take responsibility for running these jobs themselves, as a sort of single-shot PaaS. "Subscribe to the firehose and we'll help you to do a one-time map-reduce over our dataset to backfill your index, all costs on us. Just define a job using this here SDK and upload it to our dashboard; it'll be queued to run on our infra; and when it's done, you'll get emailed a link to an object-store snapshot of the ephemeral data warehouse the job populated.")
---
Also, to be clear, I wasn't intending to describe an infrastructure whose output is directly able to be used as the index of a search engine. It'd be quite useless for that, just as Common Crawl would be. Such a dataset still needs curation.
It's just that, as with Common Crawl, the curation step should rightfully be the (direct, B2B) consumer's responsibility — because there are many different use-cases such data can be put to, that require different curation strategies:
• general whole-web search engines (obviously)
• site-specific search engines
• vertical-specific search engines (think: Google Scholar; FrogFind)
• format-specific global aggregators (e.g. a PubSubHubbub gateway that pre-discovers RSS feeds; a Matrix server that discovers and suggests other Matrix servers; or that old idea of an "Internet Yellow Pages" built out of people's VCard-RDF-microformat contact data embedded in XHTML — but now extended to the proprietary pseudo-microformats of various "about me" landing-page services)
• "see previous versions" services like the Wayback Machine (taking advantage of the immutability of the historical HARs in the data stream)
• a Shodan-like deep-web "discover what doesn't want to be discovered" service, surfacing websites with Disallow * robots.txt rules
• web analytics (like you can do with Common Crawl, but live, using scalable OLTP methods)
• continuous updating of ML models with "up to date" knowledge of the world (at least, once we figure out how to continuously train ML models)
There's really a lot you can do with what's essentially a periodic high-level packet dump of the result of poking every URL you can find, as often as it is willing to let you.
I took the page down as it was attracting the wrong sort of attention. As some commenters surmised, the goal was to promote the search engine, but it wasn't working out that way...
Here's an alternative for streaming services that I've been trying out: buy second hand DVDs, rip them, then serve them with Plex. You can get 5 DVDs for £10 which is more than I can watch in a month, and less than I'd pay for Netflix, and the choice is huge, even if I restrict myself to these cheap ones.
Hi HN! Would love to get your feedback on this idea and the feature. It is super early and there are lots of issues with it, but the basic idea is there.
I worked on an idea some years ago for a couple of months until putting it up on a shelf, (beyond my capabilities) after the workable way forward was for sites themselves to identify what best labels would cover each page.
Nonetheless I slowly deduced, apart from clear spam, people would be saved a lot of time in searches if two main types of site types could easily be identified in search results, and either include or exclude these results depending on the nature of their search.
The fist being billboard or banner types, where a business had thrown up a large looking site, but really has no working data apart from address, contact, about info, quick really you knew this summary of their organisation or company.
The second are what I refer to redirection type sites, they are sites that actually don't have any / much of their own data, they're just coasting on already existing services [this might have caught people who refashion google maps with additional overlays, but so many now do not,] or an indirect way to get their parent services out thought children sites. I'm one who'd search excluding both if I'm after hard information. Generally people can use regular searches to get address and contact phone numbers for physical sales and service outlets.
I'm not sure you would like the results of what you suggest - if you are really going to crawl everything indiscriminately, you will end up with a lot of rubbish. Just check out Common Crawl if you want to get an idea of what it would look like.