Exactly. The database syntax is the same. You're still writing SQL queries against the SQLite engine. I'm just replacing disk-access with network-piece-access.
TorrentPeek goes one step further and puts this idea to use in sites, where interactions on an HTML page can be programmed to perform queries. Instead of AJAX calls, you get these sqltorrent calls.
The main reason that there are no binaries for Windows/Linux it's because I don't have these machines to compile the project against. I will eventually create binaries for all platforms. Letting it compile by node (node-gyp) would require me to distribute all of libtorrent/sqlite/boost sources which could be feasible - I'll investigate on that.
It needs a few libraries (libtorrent, sqlite) which seem to be packaged as binaries for macOS only - build them anew for your platform and it will probably run.
Is there indication when the result list for a given query is 'complete'?
Otherwise if the search result don't look like what the person wanted, how long should they wait to see if it pops in? Until the whole DB is downloaded?
Author here. Yes the SQLite engine returns an SQLITE_OK value when the query is complete, just like if it were a database file on disk.
The insight here is that we can tell SQLite (via VFS) to talk torrent-pieces rather than disk-pages, so we can essentially query torrents (torrents containing an SQLite database) on demand.
I'm guessing, since the DB is just a torrent, you can check if it's complete.
In fact if you check the gif in the post, you can see after the DB's name, it shows how much % it downloaded. So, there's your answer. When you get to 3 digits, you could be sure that no new results are going to show up.
Do you think something like this could be used for some form of a distributed blob store that you can query? Things like package managers could use this effectively.
Yep. The SQLite database can virtually contain anything. TorrentPeek is about sites, but there's nothing stopping you from using it to store information about software packages.
It's a perfect idea for package managers. And they could expand on it, like forcing clients to download a certain number of random pieces depending on the amount of seeds to make sure at least a few other clients have each piece, keeping the whole dataset healthy and reachable. And if the dataset points to other packages - all packages from those random pieces could also be downloaded and distributed.
I know there are a ton of technical reasons why this is difficult, but web-scraping and the like always seemed like a problem that could be done fairly elegantly with distributed systems.
AFAIK YaCy "peers" are just instances of solr doing search/indexing on their own (as set by the user) and the "distributed search" page only redirects the user to one of the available peers with external port open, not much p2p going on.
It requires a lot of resources with little to show for it.
From the readme it sounds like there's a community of people working on this problem. How much of this is stuff you built from scratch versus libraries you used to create the search engine?
I'm wondering what's the minimum set of items I'd need to copy over to reproduce this technology :)
Actually I'm not sure about the "community". The credit for this idea goes mostly to the person behind https://github.com/bittorrent/sqltorrent - but I don't think there are projects using this technique just yet.
I'm hoping TorrentPeek can become something like ZeroNet eventually - or maybe the ZeroNet guys can incorporate this technique in their project.
Neat, both this and sqltorrent. I'll have to read up on this in more detail...
Does this mean you (essentially) need to download the index before doing any searches? Or is there trickery to minimize even the FTS index pieces you need?
No you don't need to download the indexes beforehand. The download always starts from the first piece/page (0). From there it essentially performs binary search and only downloads the least amount of pieces possible in order to satisfy the query.
SQLite is already optimized to minimize disk-seeks, so it already knows to do the bear minimum amount of reads - which translates great for even slower I/O such as this (network access is much slower than disk access).
Right, and it's also HTML/CSS/JS for site owners to program interactions of their site using SQL queries that are prioritized against the underlying torrent.
Imagine sharing wikipedia dumps, along with an index.html page. TorrentPeek will render the index.html and expose a `sqlTorrentQuery(yourQuery)` method so that site owners can program the site around that.
You'd be able to do stuff like:
<form onSubmit=() =>
sqlTorrentQuery('SELECT * FROM wikipedia MATCH ' + this.value + ' LIMIT 50;'),
result => console.log(result))> ... </form>
Yes sorry about that. The project is still an experiment, mostly used to get real world numbers on the feasibility of the idea. I intend to maintain it and develop it much further so I will definitely clean it up.
You mean when the database is well indexed? Yes. SQLite supports Full Text Search (https://sqlite.org/fts5.html) which builds all the indexes for you, allowing you to perform search-engine-like functionality.
But you're right, it does require more attention towards index construction for site owners.
Without indexes, SQLite will do a table scan of sorts, which would still require all pages (=torrent pieces) that have data for a particular table. With indexes however this can be a lot smarter (client likely only needs a few pages from the index for a single query).
You still need to walk the tree, so you don't know which pages you need immediately- but since the root pages for the indexes should be well-distributed in the swarm, it shouldn't take very long to figure out which leaves you need.
Yes. I ran real world numbers. Before getting a result you still have to download on average ~20 pieces (32KB * 20 = 640KB). With healthy swarms and a 500kb/s connection you should get results as soon as ~1 second.
Is 640KiB not the whole index? Because partial download would mean having to request new pieces based on the previously-downloaded ones, adding latency to the mix. Assuming just 50ms latency, for 20 packets that would account for a whole second before seeing a result. And 20 decisions seems a lot, though I don't know how the index works.
What bothers me is that you've rounded the time down in your calculation despite ignoring throughput factors like packet overhead, piece message metadata, other protocol messages and processing time. I'd be interested to see some real time measurements. If you decide to do it (maybe for a research paper?), consider using kibis explicitly (and mind capitalization - b vs B).
Indeed. Without indexes it would be quite silly. SQLite supports Full Text Search (and so does TorrentPeek), so results get back after about ~20 pieces are read.
Either way, nice work. I'm sure some future Tim Berners Lee will do something crazy cool on top of this.