Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Writing a full-text search engine using Bloom filters (stavros.io)
102 points by stavros on Dec 24, 2013 | hide | past | favorite | 22 comments


I have not yet had the chance to try it out but the acrylamid static blog generator written in Python ships with its own JS search [0] using compressed suffix trees. It's probably inspired by Sphinx, the Python doc generator, which has a similar feature [1].

For static blogs, I think this could be used as a first level of search, instead of directly going to Google first.

[0] http://posativ.org/acrylamid/static-search.html

[1] http://stackoverflow.com/questions/605888/whats-the-search-e...


An alternative, but related, approach that would eliminate false positives could be to do an N-ary tree instead, where the bloom filter in this solution is just the first level of the tree. You could then download a json document for the child of the matching top level element to narrow it down to specific documents. You could also calculate relevancies this way.


I'm wondering why not build a tree of bloom filters with a bit array at each level being the bitwise OR of its child nodes. Hashing only needs be done once, and a comparison happens at each level of the tree, which can be traversed to easily find matches whilst discarding branches that are certain to be fruitless.

A bit of googling shows me this already exists and is called a bloofi tree.


Then you are better at googling than I am. Can I enquire for a link?

Edit: nm, it appeared when I searched for bloom trees instead of bloofi


This is probably an improvement over the simplest idea I can think of:

    <?php exec("grep $terms *html", $results);


Any method that doesn't open you up to remote code execution exploits is an improvement over that.


Leveraging grep was my point, not that this would actually be code I'd use. Let's assume the line before that was getting $terms out of the query and sanitizing it. I suppose that doubles the length of my solution, though!


Ah yes, php_real_escape_grep_parameters(3).


Filtering out all characters other than ASCII alphabetical with preg_replace would probably do it.

And yeah, I'm not a huge fan of php - or the ridiculous function names it is known for. However, it can be dead simple to set up for trivial tasks (which is about all od want to use it for).


Static site means never having to say "PHP"!


True, but given how easy and prevenant PHP is on hosting sites, why not use a very small amount, just one or two php pages, each 200 lines or less, and everything else static?

Probably CGI would be better, mind you, keep everything static except what is obviously not.


Because the line one draws is "no server-side code at all". If you use a bit of code for that, you might as well use a bit for this, and for the other thing, and soon it's a CMS.

Besides, the hosts I want to deploy on don't have PHP (S3, stock nginx, Github pages).


Fair enough.

I switched our team's website to a completely static site a few years ago, they were using joomla (vomit) for essentially a broucher site that changed once a quarter.

I wonder if there would be a place for a 'json search as a service' thing - where you could upload your static content (as markdown, rst, txt, whatever) and then use their servers for full text queries which return a json list of matching page URLs and titles...


Because "no server-side code" means "I can host it completely on a CDN", which pretty much solves the scalability needs. If you need to scale those few lines of php or whatever language, this is gonna be much more difficult.

Plus, you can host it on different platforms such as freenet, gnunet, and have the simplest backup ever (`cp` is enough for having a functioning version on any OS of any kind).


CGI is pretty much never better.


The motto of PHP devs is apparently "FKIN ESCAPING, HOW DOES IT WORK?". Use escapeshellarg, kids.


I didn't intend for this to represent the entire code for the search function. I've never written a php page that did something unsafe with user input. After all, '$terms' is not not $_GET['terms'] in my silly example. It's not defined.


escapeshellarg misses some cases. You need to use escapeshellarg_fixed.


real_escapeshellarg_fixed, you mean.


Nice idea :-)

I once wrote a python library for wrapping static content into a sqlite database for full text searching, for roughly the same use-case (static content, but I would like a search system). I just threw it online, if anyone is interested.

https://github.com/danthedeckie/pagestore


You are not doing full-text search with bloom filters (unless you add all partial words to the filter), only word-search, which is imho the biggest disadvantage (e.g. typos in the index or on the client's side).


This reminds me of how Datomic clients download read-only segments and process queries locally. http://docs.datomic.com/architecture.html




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: