Writing a full-text search engine using Bloom filters

zxcvgm · on Dec 24, 2013

I have not yet had the chance to try it out but the acrylamid static blog generator written in Python ships with its own JS search [0] using compressed suffix trees. It's probably inspired by Sphinx, the Python doc generator, which has a similar feature [1].

For static blogs, I think this could be used as a first level of search, instead of directly going to Google first.

[0] http://posativ.org/acrylamid/static-search.html

[1] http://stackoverflow.com/questions/605888/whats-the-search-e...

stormbrew · on Dec 24, 2013

An alternative, but related, approach that would eliminate false positives could be to do an N-ary tree instead, where the bloom filter in this solution is just the first level of the tree. You could then download a json document for the child of the matching top level element to narrow it down to specific documents. You could also calculate relevancies this way.

ris · on Dec 24, 2013

I'm wondering why not build a tree of bloom filters with a bit array at each level being the bitwise OR of its child nodes. Hashing only needs be done once, and a comparison happens at each level of the tree, which can be traversed to easily find matches whilst discarding branches that are certain to be fruitless.

A bit of googling shows me this already exists and is called a bloofi tree.

yxhuvud · on Dec 25, 2013

Then you are better at googling than I am. Can I enquire for a link?

Edit: nm, it appeared when I searched for bloom trees instead of bloofi

code_duck · on Dec 24, 2013

This is probably an improvement over the simplest idea I can think of:

    <?php exec("grep $terms *html", $results);

alcari · on Dec 24, 2013

Any method that doesn't open you up to remote code execution exploits is an improvement over that.

code_duck · on Dec 24, 2013

Leveraging grep was my point, not that this would actually be code I'd use. Let's assume the line before that was getting $terms out of the query and sanitizing it. I suppose that doubles the length of my solution, though!

jrockway · on Dec 25, 2013

Ah yes, php_real_escape_grep_parameters(3).

code_duck · on Dec 26, 2013

Filtering out all characters other than ASCII alphabetical with preg_replace would probably do it.

And yeah, I'm not a huge fan of php - or the ridiculous function names it is known for. However, it can be dead simple to set up for trivial tasks (which is about all od want to use it for).

stavros · on Dec 24, 2013

Static site means never having to say "PHP"!

deckiedan · on Dec 24, 2013

True, but given how easy and prevenant PHP is on hosting sites, why not use a very small amount, just one or two php pages, each 200 lines or less, and everything else static?

Probably CGI would be better, mind you, keep everything static except what is obviously not.

stavros · on Dec 24, 2013

Because the line one draws is "no server-side code at all". If you use a bit of code for that, you might as well use a bit for this, and for the other thing, and soon it's a CMS.

Besides, the hosts I want to deploy on don't have PHP (S3, stock nginx, Github pages).

deckiedan · on Dec 24, 2013

Fair enough.

I switched our team's website to a completely static site a few years ago, they were using joomla (vomit) for essentially a broucher site that changed once a quarter.

I wonder if there would be a place for a 'json search as a service' thing - where you could upload your static content (as markdown, rst, txt, whatever) and then use their servers for full text queries which return a json list of matching page URLs and titles...

rakoo · on Dec 24, 2013

Because "no server-side code" means "I can host it completely on a CDN", which pretty much solves the scalability needs. If you need to scale those few lines of php or whatever language, this is gonna be much more difficult.

Plus, you can host it on different platforms such as freenet, gnunet, and have the simplest backup ever (`cp` is enough for having a functioning version on any OS of any kind).

stormbrew · on Dec 24, 2013

CGI is pretty much never better.

al2o3cr · on Dec 24, 2013

The motto of PHP devs is apparently "FKIN ESCAPING, HOW DOES IT WORK?". Use escapeshellarg, kids.

code_duck · on Dec 24, 2013

I didn't intend for this to represent the entire code for the search function. I've never written a php page that did something unsafe with user input. After all, '$terms' is not not $_GET['terms'] in my silly example. It's not defined.

mcguire · on Dec 24, 2013

escapeshellarg misses some cases. You need to use escapeshellarg_fixed.

jrockway · on Dec 25, 2013

real_escapeshellarg_fixed, you mean.

deckiedan · on Dec 24, 2013

Nice idea :-)

I once wrote a python library for wrapping static content into a sqlite database for full text searching, for roughly the same use-case (static content, but I would like a search system). I just threw it online, if anyone is interested.

https://github.com/danthedeckie/pagestore

posativ · on Dec 24, 2013

You are not doing full-text search with bloom filters (unless you add all partial words to the filter), only word-search, which is imho the biggest disadvantage (e.g. typos in the index or on the client's side).

vdm · on Dec 24, 2013

This reminds me of how Datomic clients download read-only segments and process queries locally. http://docs.datomic.com/architecture.html