Hacker News new | past | comments | ask | show | jobs | submit login
Evolution of Search Engines Architecture – Algolia Search Architecture Part 1 (highscalability.com)
217 points by PretzelFisch on Aug 8, 2021 | hide | past | favorite | 67 comments



The issue with Algolia is that they have insane technology but it is mostly used only to search documentation.

They are struggling to sell their techno to people who need them deeply, for a lot of reasons. But one of them is that they are a tricky choice. It is not a database technology, so not a developer choice but also their technology is only useful to developers.

As a result they have to try to sell their product when you need a search but no developers are working on it. That's how you end up powering external and internal documentation portals. That's really a waste of resource


I disagree. I'm a early Algolia adopter(and an engineer) and as an engineer they are a no-brainer for me.

Search is really hard even with the best elasticsearch libraries. IMO the biggest blocker with algolia is the price. It's really hard to get company buy-in because leadership don't get it: "Just build it"


Agreed. The pricing gets crazy fast when you're not really in control of your record / query count, like the typical SaaS dealing with customer data.

It feels tailored to first party use.


> But one of them is that they are a tricky choice.

How so? (What you follow that statement with doesn't seem to explain it.)

When I took a close look at Algolia for a project it seemed straightforward as a choice, but the cost would've been completely out-of-wack in comparison to what I was spending for the rest of the tech stack. This aligns with what a contributor to Typesense mentions in the comments, which is that what they hear from many potential Algolia customers is that it's "a great product but can get quite expensive at even moderate scale".


> The issue with Algolia is that they have insane technology but it is mostly used only to search documentation.

This is a really interesting side-effect of what Algolia was probably trying to do: use documentation search to spread brand awareness, but then ironically, because of their (successful IMO) strategy, the product is being perceived as mostly being used primarily for documentation search.


I am about to roll out search on Shepherd.com and looking at using Algolia. I've been impressed with Algolia on Hacker News...

Is anyone else using them? What are your impressions so far?

Much appreciated


I work on an open source alternative to Algolia called Typesense.

Algolia is a great product but can get quite expensive at even moderate scale. If I had a dollar for every time I’ve heard this from Algolia users switching over…

I recently put together this comparison page, comparing a few search engines, including Algolia, you might find interesting: https://typesense.org/typesense-vs-algolia-vs-elasticsearch-...


It's missing the most important thing:speed. We moved to Algolia mainly because of this. Elastic Search and Solr could not compete.


Oh yes. Speed is an important point. ElasticSearch & Solr use disk-first indexing (with RAM as just a cache), whereas Algolia and Typesense use a RAM-first approach where the entire index is stored in memory. This is what makes Algolia/Typesense return results much much faster than ES/Solr, and lets you build search-as-you-type experiences for each keystroke.

I was thinking about adding a row about speed to the comparison matrix, but couldn't find a way to express the comparison clearly... Imagine a row that said:

Search Speed | Super-fast | Super-fast | Slow? ...

That felt a little off. So I resorted to just mentioning primary index location as a proxy.

Open to suggestions on how to express this succinctly.


What index sizes are we talking about? If it's a few hundred gigs there's always the possibility of putting the entire ElasticSearch index into a ramdisk, or even just leaving lots of "free" RAM meaning the underlying OS will use it to speed up I/O transparently. Bare-metal machines with insane RAM sizes are a thing, and at massive scale could make sense.

I've had great success at a client where simply upgrading a DB to an instance with enough RAM to fit 80% of the entire data set fixed all performance problems and significantly reduced I/O "pressure" at least for reads (writes were never a problem).


I haven’t tried to do this myself so I can’t speak to it.

But one thing I would add is ElasticSearch is quite versatile and flexible, so I wouldn’t be surprised if you can contort it to get it to work for a wide variety of use cases. This is a blessing and a curse - blessing because it’s so flexible, curse because the flexibility breeds complexity and brings with it a steep learning curve and operational complexity.

Where I think Algolia / Typesense help is that things work out of the box without the learning curve or operational overhead.


That table seems... fine? Creating multiple data sets and comparing the various aspects of the products speed is a lot of additional work that you may not have signed up for, and is far from succinct (or easy). It might feel more empirical, but "feels faster" is fine. You're providing a free service - review of available products, and can use whatever metrics you choose.


What happens when the index can’t fit into memory with Typesense?

Does it OOM?


Correct, the OS’ OOM reaper will kick-in to try and protect core OS operations. So you don’t want to let it get to that stage - you’d typically want to keep at least 15% free memory for the OS to do its thing.

Commercially available RAM today goes all the way to 24TB these days, which should be sufficient for a good number of search use cases. Beyond that you’d have to shard data across multiple clusters.

Similarly with Algolia, they use 128GB RAM clusters, and recommend you keep your Algolia index size below 100GB: https://www.algolia.com/doc/guides/sending-and-managing-data...


Why not place the main algorithm for speed of search, so users can lookup the difference on another page.


Looks pretty interesting. There never really seemed to be any good alternatives to ES for a long time. Apart from building feature set, how do you target quality of search results? Do you have any test bed for measuring this and do you benchmark against other solutions to try and understand how everyone fares?


We have search relevancy tests baked into the automated test suite that runs on every commit. We keep adding to it as we get feedback about edge cases and new cases.

We don’t have comparative relevancy benchmarks. But we fo have performance benchmarks here: https://typesense.org/docs/overview/benchmarks.html


Any particular reason why Typesense can't handle:

> Exact Keyword Search ("query")

Any plans on adding it in future?


Just a question of priority, based on number of asks for it. We do plan to support it.

In the meantime, we introduced a way to turn off typo tolerance and prefix-search on a per-field basis. This has helped some users search for fields containing model numbers for eg.


Price is 100% why we're looking at typesense.


Does Typesense support searching in non-latin languages?


Yes it does - all languages except logographic ones (Chinese, Japanese and Korean) which we are actively working on: https://github.com/typesense/typesense/issues/228


Okay, those are the ones I meant ;) – thanks! This is the main limitation with our postgres implementation rn.


I heard a joke about FTS engines, but Whoosh !


i’m using MeiliSearch, which is a open source alternative

worth giving a look

https://github.com/meilisearch/MeiliSearch


Did not know about MeiliSearch. Looks really great! Thanks for sharing.


It's easy to use and setup. If pricing and closed source is OK with you then it's worth it. We've used them few years ago and then switched to ES. Think of it like of pre-docker Heroku.


Out of curiosity, what made you choose ES over Algolia?


As @kirubakaran said it was the price and the closed source license. If search becomes a very important part of your business you better own it rather than outsource it.

Algolia is great to get started but it doesn't make sense at scale. If you have large indexes it's just too expensive.


What did the migration effort look like when moving from Algolia to ElasticSearch? Also, were you able to replicate the same user experience?


From the comment, I guess "pricing and closed source" became not OK


Using them for a side project. Very impressed with their developer experience. Their React instant-search plugin is great, and very easy to add. Their documentation is great. Their admin UI is great.


We have it configured for https://docs.lowdefy.com

Really happy with the service it provides and the ease of implementation. Note that because the docs can take a few seconds to load, the their crawler times out and misses some content some of the time. With better page performance this should not be an issue.

(We are actively working on some cool ideas to make Lowdefy apps super fast)


Big up-front disclaimer: my job is making software at Loop54 and my salary comes from happy customers of our service.

One of our goals is similar to yours: browsing an online store should be like walking around in a physical store. The navigation system on the site should be as adept as a knowledgeable store employee in helping you find exactly what you're looking for.

At Loop54 many of our customers come from Algolia. It's very popular, and nobody ever gets fired for buying Algolia. In that sense, it's a safe option.

On the other hand, customers come to us from Algolia because Algolia requires a bit of hand-holding and it still doesn't quite seem to get what users are really looking for. When our prospects run randomised controlled trials, our search consistently seems to give users what they want better than Algolia does, with less effort. I can ask about specific numbers if you want.

However, another strength of Algolia that Loop54 is currently behind in is in the surrounding tooling. For better or worse, with Algolia, you'll have more knobs and levers to play with (and you'll need them much more often!)

We do have one or two customers that have a majority of books in their product catalogues, and we know there are some unique challenges that come with that domain.

Loop54 is a very competent, but smaller player. If you think it's interesting, it's worth talking to us. I can't evaluate how good a fit your site would be for us, but that's why we have people who do that for a living!

Edit: I should also say that yes, Loop54 is even more expensive. You shouldn't blindly trust us (or any other provider.) I would strongly suggest running a randomised controlled trial to see whether any expense at all is worth it in your case.

I say this in part because I'm a man of science and believe in experiments to measure things, but also out of self-interest; anyone can throw out impressive marketing, but our search truly shines when put to the test against the alternatives.


I am sorry but https://www.loop54.com/pricing is just totally snied. No monetary information whatsoever. Why even blag me to a pricing page with less than zero pricing honesty.


You're absolutely right, of course. We're in the process of fixing that.

The TL;DR is that we're offering customers more flexibility than the tiered model suggests. Consequently there's a large variability in what customers pay. A number on the pricing page turned out to be more misleading than helpful for our target audience.

The long-term solution is creating a more flexible pricing model that can be very transparent. We are working on that. (Though as you know, generic, modular solutions take a while to get just right, unfortunately.)

We didn't think a short-term solution was needed, but based on your comment, that might be worth reconsidering. The best I can think of is a 90 % range of what customers actually end up paying for each tier. What suggestions do you have?


Ya just to mention, the last company I worked for had a policy of not working with any company that doesn't clearly disclose pricing publicly. I've adopted that one too.


Some idea is better than nothing but having a pricing link with zero substance does really really erk especially when your trying to market yourself here as a viable Algolia alternative. They at least give some price points.


I know a health tech company that uses it to power typeahead for prescribing drugs. It worked super well the last time I saw it in action.


It's strange, I don't really like using the HN Algolia search. I think it's because the responsiveness doesn't fit HN and the results are okay but not great? What are some other big sites that use Algolia as their search backend? It would be interesting to compare.


The problem with HN Algolia search is that they never updated it after comment scores were hidden. The default is to sort comments by score, but all recent comments are ranked last because their score is just assumed to be 1 or something like that.

I also wish it searched both stories and comments by default. I guess you can set your own defaults but meh, I use many different computers and don't change defaults as a general rule because it's a hassle to keep all the computers in sync.


Ah, that makes sense! I always wondered why any search term would just return 8 year old stuff at the top.


> the results are okay but not great

What results do you expect more than keywords search ranked by upvote on HN? I find it great honestly, it's fast and don't do magics


We use it for general search and similar items results on www.liveauctioneers.com


The comfort that they provide is trap sometimes! Algolia suggests that frontend sends the queries directly to its service instead of going through our backend, which is good if you want to have a good search engine fast. But don't go for it without considering the consequences. It will take over part of the frontend and your product will depend on Algolia to the point that implementing a single favourite functionality for your users may need to integrated with their service if you're not careful!


No https in 2021?


No. In fact most websites don't need HTTPS and pointless data transfer. Wish we could go back a few years on this zeitgeist.


This is false. Just because the page content isn't sensitive, that doesn't mean that TLS is worthless.

TLS prevents your run of the mill MITM scenarios. Like ISPs inserting ads (something Comcast actually did), or public wifi doing the same. Or worse, more malicious scripts.

You could argue that all I'm really looking for in most cases is message integrity (signing), but if you're going to do that, you might as well just encrypt it too and avoid accidents where sensitive information is sent over encrypted channels.


Every visited HTTP website is a network vulnerability.

It doesn't matter what is supposed to be on these sites. From security perspective they contain MITM attacker's content. They are effectively an API for issuing arbitrary commands to the browser. To shut down this attack API, all sites have to stop using HTTP, no exceptions.


Even if you disagree (and I do as well), downvotes are the wrong way to signal disagreement. Either invest in posting a response or move past it- there's no need to hide the comment from others. We're better than that, or at least we should be.


Are suffix tree/array used at all? How about n-gram with Bloom filter for filtering documents?


idk this seems more an evolution of clustering, when I think about search engines I think more at the progression toward stemming, lemming, synonym matching and context matching.


It’s a highscalability blog post, though, which usually focuses on precisely the clustering, sharding, etc aspects.

Not saying you’re wrong, but it’s just a different audience that would be interested in the actual search algorithms.


Also doing it in memory (which is what all the regular search engines do right?)


No, ElasticSearch for example uses a disk-first approach.


ES uses a disk first approach but only on first load, and is smart enough to load similar results for frequently searched items into cache as well. That's why search return times differ significantly between hot and cold queries. This is actually such a problem that a lot of the times in older ES versions you wind up prewarming the ES cache before you can actually let it be used in production. Most alternative search engine implementations and especially in vector based search engines load into memory first when brought up versus at query time. ES still isn't great at this and is one reason they're falling behind in modern search, that and their vector search support is kind of abysmal as of even 6 months ago.


In Elasticsearch 7.7 they got rid of a lot of that caching, and they now rely on OS-level disk caching for most index accesses: https://www.elastic.co/blog/significantly-decrease-your-elas....


They're falling behind compared to what exactly?

Elastic is great on-disk, especially on SSDs and avoiding issues like write amplification.

Loading large indexes in memory isn't simple/cheap and when it comes to vectors we're talking apples/oranges I feel. Modern search architectures need to embrace ensemble approaches but boolean-based content searches is often the primary util in enterprise (and search is supplemented by a customizable td-idf). Using vector-based retrieval & similarity is still useful but not something you necessarily need elastic to do for you or couldn't co-exist together.


I've scaled a cluster that was in the 100s of millions of results range, the experience was not great and tuning for our use case which was decidedly not a typical enterprise search problem and that made it a complete pain. So that's great that it works for that particular case, and we ultimately made it work ourselves much like you're suggesting, we used vector search with something like FAISS as a pre filtering step and then a final search through a much reduced set of ids in ES but it's pretty clear a new player could come in here and make a much better experience. Basically ES is, if not unsuitable, a big pain for large non enterprise search such as web search, where things like vector search are one major signal and provide a better search experience. And that's the exact problem there aren't off the shelf open source solutions if you're not doing a fairly standard ecom or internal business search type problem like log aggregation or internal documents.

I'm also suggesting that use cases that aren't enterprise type search problems are more common than you'd think these days.

Edit: Additionally the thing here is you have classical boolean search systems like ES and vector search solutions like Milvus, but no one's gotten around to making something that does both well, from what I can see a lot of the players in the space are trying to go in that direction but it's a slow painful crawl that results in this type of situation where we had to do a lot of custom gluing of these systems together and keeping that parity that was super annoying and expensive, and time consuming, but not necessarily performance inhibiting.


Have you seen vespa.ai ?


I suppose I meant web search engines like Google


in memory search works well as long as you dont care about persisting your data.. for most companies that would like a big chunk of their strategic assets


no BERT?


Neural search in combination with learning algorithms and traditional keyword searches are clearly the future. They vastly outperform traditional search engines.

At sajari.com we have been working on an experiment that uses a 1 cpu machine on cloud run to serve a neural network generated, hash based index of an old BestBuy catalog (25k products). Retrieval uses an approximate nearest neighbour (ANN) look up which typically takes ~1msec. Speed and relevancy are already pretty good.

But we have also learned that there is no one silver bullet and we have seen the best results when combining neural search with traditional keyword search and reinforcement learning.

You can take a look at the demo here: http://neural-hashes.sajari.com

Be gentle, this is an experiment and not a production scale implementation.


This is easily the most fun thing I’ve been involved with for years. Can’t wait to see it ship.


I've scaled large transformer based models that supplement a lucene-based search engine. The architecture supports an ensemble approach where Lucene results are first-class and then we tailor similarity rankings with the models.

It looks a lot like this: https://huggingface.co/blog/bert-cpu-scaling-part-1

We have to store large "index" embeddings on SSDs and use leveldb for value retrievals of the lucene results.


Yep I was surprised -- google and others have long moved to neural search, afaict, where we are seeing things like faiss for indexes based on embeddings, and all sorts of deploy pain around training+inference. I knew that was still true for elastic, but hadn't realized also for their replacements. So this article is clustering for pre-neural search, and guess enterprise search is still getting there..




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: