Alexandria Search

version_five · on March 18, 2022

There have been a few search engines out recently. I'm curious how people evaluate them quickly.

I've realized my searching is basically optimized for google and the web that has grown up around it. Also, in 1998 I wasn't as aware of what was out there as I am now. It's pretty rare (even if its possible) that I do a search and come across a completely new site that I haven't heard of before, for anything nontrivial. That was different when search began.

Google is now almost a convenience. If I have a coding question, I search for "turn list of tensors into tensor" or whatever but I'm really looking for SO or the pytorch documentation, and I'll ignore the geeksforgeeks and other seo spam that finds it's way it. It's almost like google is a statistical "portal" page, like Yahoo or one of those old cluttered sites were, that lets me quickly get through the menus by just searching. That's different from a blank slate search like we might have done 25 years ago.

I think what's really lacking now is uncorrupted search for anything that can be monetized. Like I tried to search for a frying pan once on google and it was unusable. I'm not sure any better search engine can fix that, that's why everyone appends "reddit" to queries they are looking for a real opinion on, again, because they are optimizing for the current state of the web.

Anyway, all that to say I think there are a lot of problems with (google dominated) search, but they are basically reflected in the current web overall, so just a better search engine, outside of stripping out the ads, can only do so much. Real improved search efforts need to somehow change the content that's out there at the same time as they improve they experience, and let us know how to, in a simple way, get the most out of it. I think google has a much deeper moat than most people realize

outcoldman · on March 18, 2022

> I've realized my searching is basically optimized for google

Is it just me, or I feel like Google does not provide anymore good results for me.

Like every time I search something completely out of my knowledge, like "How to purchase a property in Mexico", it will give me 100+ results of some results with autogenerated content like "10 best places to buy property in Mexico". And the only way to fix that would be to add something like `site:reddit.com`

pygar · on March 18, 2022

> Is it just me, or I feel like Google does not provide anymore good results for me.

I am starting to suspect that there might be nothing to find.

I just don't think people (other then the tech oriented) are creating websites and running forums - and why would they? Reddit might be be only place you _can_ find that type of content. What should search engines do then?

With a tiny number of exceptions, it might be that people chat on reddit, read Wikipedia, ask questions on the stackexchange network/Quora, local communities use facebook groups, and businesses have a wordpress site with nothing more then a bit of fluff, a phone number and an email address.

billfruit · on March 19, 2022

On a tangent, google seems to turn up lot more obnoxious Quora results rather than reddit ones. Anyone find that to be the case?

Seirdy · on March 18, 2022

Kagi's "noncommercial" lens, search.marginalia.nu, and engines that don't parse JS (e.g. Mojeek) can add some variety. Another thing I like doing is adding phrases like "creative commons" to already-long queries to filter out some corporate results, or adding `-gdpr -ccpa -"sell my info"` to limit results to sites made by small orgs and individuals who don't collect enough data or make enough money to warrant compliance measures.

_k9eq · on March 18, 2022

Might be an instance of Goodhart's law: https://en.m.wikipedia.org/wiki/Goodhart's_law

If all websites try to optimise for SEO, they undermine the assumption that the evaluation of a search engine is the pure consequence of how well a site satisfies a query.

hinkley · on March 19, 2022

I really think than one, we are going to have to end up with search engines managing a curated list of 'roots', and two, those roots are going to end up consisting substantially of a mix of more 'human' sites and, let's be honest with ourselves, a certain amount of content that is paying for favoritism.

I think it's very possible that we have effectively raised the noise floor so high that there is no signal, but also likely that perverse incentives from trying to profit off of search engines have made them our enemies instead of our friends.

For instance, does Google favor sites that run their own tools on them? I've stopped paying attention but recall hearing mutterings to that effect. If so then running the tools is a protection racket.

For other perverse incentives: if you try to rank sites by how long someone stays on them before backing out, or searching again, then you end up favoring rabbit-hole sites, that either string you along or suck you into a tangent. "Oh, this must have answered their question about keeping bees," no, they're reading gossip about the Queen of England and have forgotten all about beekeeping.

IAmEveryone · on March 19, 2022

Oh, look! It’s the same „flaw“ people have been explaining each other for a decade. We’ve all seen it dozens of times, yet Google apparently remains in the dark.

danuker · on March 18, 2022

> because they are optimizing for the current state of the web.

I believe people will at least start looking for alternatives. For example, I have been collecting search engines, and whenever I encounter a page with too many commercial-laden SEO-porked results, I use a different search engine in Firefox.

I have enabled the Search Bar, I can do Alt+D, Tab, Tab, enter my query, then click a different search engine, which searches instantly, unlike the main bar, where you have to press Enter once more after clicking.

I just added this one also. See my collection: https://0bin.net/paste/ZSCRYVx1#sxD+jBIpScJismXBYwaoJPh75TH9...

blewboarwastake · on March 18, 2022

Pro tip: Alt+E takes you directly to the search bar, then you can press Tab for selecting the search engine. The best part is that you never use the mouse this way. You can also use ddg bangs, they contain every search-engine/site by pressing Alt-D if you remember the bang for the site.

danuker · on March 18, 2022

Alt+E takes me to the Edit menu.

projektfu · on March 19, 2022

Ctrl-E

coverband · on March 18, 2022

Could you paste again as text links? Thanks.

danuker · on March 18, 2022

Firefox uses a custom format for compressing the SEARCH ENGINES. I am appalled.

I had to do a `pip install lz4` then apply and run this change: https://gist.github.com/Tblue/62ff47bef7f894e92ed5?permalink...

And it did not work. I hit a brick wall. I completely lost trust in Firefox. I want a browser created by a non-profit. Thank you Google for corrupting everything you touch.

I then found this blessed soul: https://www.jeffersonscher.com/ffu/searchjson.html

And here are the ones not built-in:

https://metager.org/meta/meta.ger3?eingabe={searchTerms}

https://wiki.archlinux.org/index.php?title=Special:Search&se...

https://www.mojeek.com/search?q={searchTerms}

https://www.openstreetmap.org/search

https://pypi.org/search/?q={searchTerms}

https://www.qwant.com/?q={searchTerms}&client=opensearch

https://searx.xyz/

https://swisscows.com/opensearch.xml

https://yandex.com/search/?text={searchTerms}&from=os&clid=1...

https://whoogle.sdf.org/search

https://lazyweb.ai/nav/

https://en.wiktionary.org/w/api.php?action=opensearch&format...

https://github.com/search?q={searchTerms}&ref=opensearch

http://www.urbandictionary.com/define.php?term={searchTerms}

http://alternativeto.net/browse/search/?q={searchTerms}

https://search.marginalia.nu/search?query={searchTerms}&ref=...

https://kagi.com/search?q={searchTerms}

https://you.com/search?q={searchTerms}

https://millionshort.com/search?keywords={searchTerms}&remov...

https://search.f-droid.org/?q={searchTerms}&opensearch=1

https://www.gnod.com/search/

http://teclis.com:1333/?topics={searchTerms}

https://wiby.me/?q={searchTerms}

https://www.alexandria.org/?q={searchTerms}

Seirdy · on March 18, 2022

Some of these (Andi, nee Lazyweb; You; SwissCows) are Bing proxies. Gnod is a search launcher, not an engine unto itself.

Many more installable engines are available at https://mycroftproject.com as OpenSearch XML plugins, compatible with Firefox and discoverable by Chromium.

danuker · on March 20, 2022

Thanks! I will remove duplicates.

zeta0134 · on March 18, 2022

I've been wondering for a while now about building a search engine for the ad free web. That is, penalize or outright refuse to index any recognized advertising network, letting through only those sites which don't perform invasive tracking with third party services. Mostly as a curiosity: what would be left? What would rise to the top when you filter all of that out?

version_five · on March 18, 2022

I've thought about something similar, basically "the good internet" that would be a hand curated list of sites that are not there just as a pretext for ads. I think a lot of software project documentation qualifies, lots of stuff on university sites like lectures notes for example. I assume that across different niches there is other stuff like this. I think the key would be something that can't be gamed, like it has to be legitimate content that is online for an existing purpose and not as a pretext.

randomsilence · on March 18, 2022

Check the 'Small or non-commercial Web' search engines on this overview page: https://seirdy.one/2021/03/10/search-engines-with-own-indexe...

ephbit · on March 22, 2022

This approach reminds me of YACY even though it's only somewhat related.

Instances of the engine indexing locally on the machine where a user is browsing the web.

I'm regularly reminded of this concept of a distributed search index, but at the moment it doesn't seem very likely that it'll gain traction.

aghilmort · on March 19, 2022

exploring adding that search / similar as breeze filter at https://breezethat.com/- not sure can do with curation + bing or google, may have to wait until we bring our scraper out of testing

not2b · on March 18, 2022

Wikipedia.

not2b · on March 18, 2022

Evidently someone disagrees, but no ads or trackers except on the home page and its pages rank highly on current search engines, so if you exclude trackers and ads that's what you're going to get.

vishnugupta · on March 18, 2022

> I'm curious how people evaluate them quickly.

Speaking about myself; I cold turkey migrated to DDG ~2 months ago. So far I've had to resort to Google search 10 times or so.

One thing I miss though is Google's nice visualisation of fast changing results e.g., match scores. For example: https://imgur.com/a/Q5nZkjo

Seirdy · on March 18, 2022

DDG's organic link results are from Bing, sans personalization. DuckDuckGo advertises using "over 400 sources", which means that at least 399 sources only power infoboxes ("instant answers") and non-generalist search, such as the Video search.

amelius · on March 18, 2022

> I'm curious how people evaluate them quickly.

Are there search benchmarks to be found somewhere?

There must be. If you want to write a search engine, you need a way to validate the results.

jll29 · on March 18, 2022

The "Web track" task at the annual US NIST TREC conference ( https://trec.nist.gov/ ) is an open innovation benchmark that everyone can contribute; participants get a set of queries that they have to run on exactly the same corpus. Then they return the top-k results to a team that evaluates them.

Here is an example (2014) Web track paper of the 23th TREC: https://trec.nist.gov/pubs/trec23/papers/overview-web.pdf (TREC has a plentitude of difference benchmark tasks and you can submit your own: https://trec.nist.gov/pubs/trec29/trec2020.html - recent TREC 2020 papers)

vivekraghunatha · on March 19, 2022

(Founder of Neeva) Eval for search engines is about as hard as ranking for search engines. You need rating templates, raters, querysets, tooling and lots of time and patience. There are a number of vendors who can help you with the raters part, but the rest is still painstaking work you have to do yourself. Email me if you need help; we are happy to share findings.

marginalia_nu · on March 18, 2022

There are benchmarks within the adjacent field of information retrieval, but in general it's hard to properly validate a search engine because real data is so noisy and misbehaved, and sample data is so different from real data.

kreeben · on March 18, 2022

Sure, the problem of information retrieval is not exactly that of web search but they're pretty close. So, from such a knowledgeable person such as yourself, when it comes to this topic, could you remind us, what are some of those benchmarks?

marginalia_nu · on March 18, 2022

https://en.m.wikipedia.org/wiki/Precision_and_recall

For some standard corpus.

kreeben · on March 18, 2022

>> Precision and recall

The phrase I was looking for. Thx a bunch! Gonna marginalia that now.

marginalia_nu · on March 18, 2022

Haha, ironically it lacks both precision and recall for that topic.

chris123 · on March 18, 2022

I'd love to be able to add a tag to a search to have it exclude sites with any kind of monetization, I know that's not realistic cuz that's where Google makes most of its money (or do they make most of their money somewhere other than advertising these days?). Anyways, yeah, I'm sick of SEO optimized, click optimized, advertising optimized, affiliate link optimized crap.

vivekraghunatha · on March 19, 2022

(Founder of Neeva here) -- Chris -- I love the idea. Neeva does this in a contextually relevant manner to the intent of the query. For example, on health queries, we label all health sites as "trusted", "ad-supported" etc. and allow you to filter down to the appropriate subset of results. For programming queries, we label sites as "official sites", "forums", "blogs", "code repos", "programming websites" (the SEO-ed ones). We work with human raters to do this. Would love to hear if you find it useful and what other labels would be of use.

demopathos · on March 18, 2022

I'm surprised you are not a fan of geeksforgeeks. While each of their webpages have substantially less content than the pytorch docs or SO result, I find that they get to the point instantly. My mean time to solution from G4G is definitely smaller than SO.

version_five · on March 18, 2022

I guess everyone has their go-to sites and their pet peeves. Geeksforgeeks may be less spammy than some, but I still think of it as that annoying site that got in the way of either the SO or documentation answer that I was looking for.

Just to expand, if I want the api reference, say I search for defaultdict (for some reason I like using them but always have to look at the reference), I want the python documentation. I definitely don't want a third party telling me about it.

And if I search a "make list of tensor into tensor" type question, I want SO where someone had asked the same question and got "tensor.stack" as the reply, so I can understand the answer and follow up by looking at the tensor.stack pytorch reference it I want.

Anything else is wasting my time, I think most users with similarly specific queries are not looking for tutorials, they are looking for the names of functions they hypothesize exist, or references. That's why intermediary sites that try to give an explanation are annoying, at least for me.

Seirdy · on March 18, 2022

I generally find that sites like SO, GFG, etc. often play the role of "Reading the Docs as a Service". I prefer using them only after official documentation or specifications fail me. When I want an opinionated answer, I just ping some people I already know or check the personal websites of the developers of the language/tool I'm using. If I have further questions, I check the relevant IRC channel. Sites like SO are a "last resort" for me.

In other words, I'd rather see these at the bottom of the SERP than the top, but I wouldn't want to completely eliminate them.

_bohm · on March 18, 2022

I've found that their content is often inaccurate or written by people who come across as novices. I actually emailed them to correct an inaccuracy in one of their articles once, which they did, so kudos to them for that.

RosanaAnaDana · on March 18, 2022

Geeksforgeeks is toxic garbage poisoning the well of good solutions for common problems.

ffhhj · on March 18, 2022

> There have been a few search engines out recently

I'd like to try them out, could you mention which?

Seirdy · on March 18, 2022

I listed a bunch over at https://seirdy.one/2021/03/10/search-engines-with-own-indexe..., and I'm always adding more.

I first discovered Alexandria in early February: https://git.sr.ht/~seirdy/seirdy.one/commit/935b55f10f9024ee...

Around the same time, I also discovered sengine.info, Artado, Entfer, and Siik. By sheer coincidence they all were mentioned to me or decided to crawl my site within the same couple weeks. So yes, from my perspective there have been more than a few new smaller engines getting active on the heels of bigger names like Neeva, Kagi, Brave Search, etc.

ffhhj · on March 19, 2022

Great work!

lolinder · on March 18, 2022

I've been using the Kagi beta for a few months now, and it's awesome: https://kagi.com/

The biggest thing I've found is that when doing technical searches it always turns up the sources I'm actually looking for, actively filtering all the GitHub/StackOverflow copycat sites.

It also seems to up-weight official docs compared to Google. For example, "how to read a json file in python" turns up the Python docs as the second result, where in Google they're nowhere to be found.

aghilmort · on March 19, 2022

am founder of Breeze, https://breezethat.com/

1. free version is curated topic filters that sit on top of Google -- best on laptop or desktop at moment / iterating mobile due to ad splash; it's unclear due to TOU that we can ever make that fully server-side legally, though do so have some things testing to see if free version can be ad-free or more tracking free

2. premium version will be mix of our scraping and Bing, depending on topic - standard web search via Bing + same curation as (1), closer to real-time for us

3. have tested out most other indie indexes or sites, for full web scale, money is on ahrefs or brave giving bing / google run for it

4. our primary emphasis is on bringing back some of the Yahoo! directory or Alta Vista look & feel of drilling into topics, so balancing 1-3 (^) as best we can atm, small team, fully bootstrapped modulo tiny F&F round

5. also @DotDotJames on twitter, still iterating when add team info to site

vivekraghunatha · on March 19, 2022

(Founder of Neeva here) I'd love for you to try us (www.neeva.com) and tell us what you think ... (either here or via the support button on the SRP)

version_five · on March 18, 2022

you.com and kagi.com off the top of my head

Seirdy · on March 18, 2022

I've been keeping my eye on You.com, tracking a few SERPs over time compared to other Bing- and Google-based engines. So far, the results don't seem independent.

Try comparing results with a Bing-based engine (e.g. DuckDuckGo) or a Google-based one (e.g. StartPage, GMX) to see if they differ. (Don't use Google or Bing directly, since results will be personalized based on factors like location, device, your fingerprint, etc.).

8n4vidtmkvmk · on March 18, 2022

maybe the solution is to make google itself reddit style. let users downvote the seo spam websites and allow them to be downranked. sure it opens the door for a different kind of abuse... but maybe that problem is more fixable?

seltzered_ · on March 18, 2022

I feel like this was tried a decade ago: https://developers.google.com/search/blog/2011/03/introducin... ("Introducing the +1 button", Google, 2011)

Seirdy · on March 18, 2022

> I'm curious how people evaluate them quickly.

To paint with a broad brush, I look at three criteria:

1. Infoboxes ("instant answers") should focus on site previews rather than trying to intelligently answer my question. Most DuckDuckGo infoboxes are good examples of this; Bing and Google ones are too "clever".

2. Organic results should be unique; most engines are powered by a commercial Bing API or use Google Custom Search. Compare results with a Bing or Google proxy (duckduckgo, startpage, etc) to avoid personalized results. Monitor queries over time to see if SERPs change in ways that diverge from Google/Bing/Yandex.

3. "other" stuff. Common features I find appealing include area-specific search (Kagi has a "non-commercial lens" mostly powered by its Teclis index; Brave is rolling out "goggles"), displaying additional info about each result (Marginalia and Kagi highlight results with heavy JS or tracking), user-driven SERP personalization (Neeva and Kagi allow promoting/demoting domains), etc.

And always check privacy policies, TOS, GDPR/CCPA compliance, etc.

> Google is now almost a convenience. If I have a coding question, I search for "turn list of tensors into tensor" or whatever but I'm really looking for SO or the pytorch documentation, and I'll ignore the geeksforgeeks and other seo spam that finds it's way it. It's almost like google is a statistical "portal" page,

I like engines like Neeva and Kagi that allow customizing SERPs by demoting irrelevant results; I demote crap like GFG, w3schools, tutorialspoint, dev(.)to, etc. and promote official documentation. Alternatively, you can use an adblocker to block results matching a pattern: https://reddit.com/hgqi5o

karmab · on March 18, 2022

I sEeU

josefcullhed · on March 18, 2022

Hello,

My name is Josef Cullhed. I am the programmer of alexandria.org and one of two founders. We want to build an open source and non profit search engine and right now we are developing in our spare time and are funding the servers ourselves. We are indexing commoncrawl and the search engine is in a really early stage.

We would be super happy to find more developers who want to help us.

cocoafleck · on March 18, 2022

I was trying to learn more about the ranking algorithm that Alexandria uses, and I was a bit confused by the documentation on Github for it. Would I be correct in that it uses "Harmonic Centrality" (http://vigna.di.unimi.it/ftp/papers/AxiomsForCentrality.pdf) at least for part of the algorithm?

josefcullhed · on March 18, 2022

Hi,

Yes our documentation is probably pretty confusing. It works like this, the base score for all URLs to a specific domain is the harmonic centrality (hc). Then we have two indexes, one with URLs and one with links (we index the link text). Then we first make a search on the links, then on the URLs. We then update the score of the urls based on the links with this formula: domain_score = expm1(5 * link.m_score) + 0.1; url_score = expm1(10 * link.m_score) + 0.1;

then we add the domain and url score to url.m_score

where link.m_score is the HC of the source domain.

jll29 · on March 18, 2022

The main scoring function seems to be index_builder<data_record>::calculate_score_for_record() in line 296 of https://github.com/alexandria-org/alexandria/blob/main/src/i..., and it mentions support for BM25 (Spärck Jones, Walker and Robertson, 1976) and TFIDF (Spärck Jones, 1972) term weighting, pointing to the respective Wikipedia pages.

josefcullhed · on March 18, 2022

This is actually not used yet. Working on implementing that as a factor.

kreeben · on March 18, 2022

Thanks for sharing this with the world. Did you manage to include all of a common crawl in an index? How long did that take you to produce such an index? Is your index in-memory or on disk?

I'd consider contributing. Seems you have something here.

josefcullhed · on March 18, 2022

The index we are running right now are all URLs in commoncrawl from 2021 but only URLs with direct links to them. This is mostly because we would need more servers to index more URLs and that would increase the cost.

It takes us a couple of days to build the index but we have been coding this for about 1 year.

All the indexes are on disk.

kreeben · on March 18, 2022

>> All the indexes are on disk.

Love it. Makes for a cheaper infrastructure, since SSD is cheaper than RAM.

>> It takes us a couple of days to build the index

It's hard for me to see how that could be done much faster unless you find a way to parallelize the process, which in itself is a terrifyingly hard problem.

I haven't read your code yet, obviously, but could you give us a hint as to what kind of data structure you use for indexing? According to you, what kind of data structure allows for the fastest indexing and how do you represent it on disk so that you can read your on-disk index in a forward-only mode or "as fast as possible"?

josefcullhed · on March 18, 2022

Yes it would be impossible to keep the index in RAM.

>> It's hard for me to see how that could be done much faster unless you find a way to parallelize the process

We actually parallelize the process. We do it by separating the URLs to three different servers and indexing them separately. Then we just make the searches on all three servers and merges the result URLs.

>> I haven't read your code yet, obviously, but could you give us a hint as to what kind of data structure you use for indexing?

It is not very complicated, we use hashes a lot to simplify things. The index is basically a really large hash table with the word_hash -> [list of url hashes] Then if you search for "The lazy fox" we just take the intersection between the three lists of url hashes to get all the urls which have all words in them. This is the basic idea that is implemented right now but we will of course try to improve.

details are here: https://github.com/alexandria-org/alexandria/blob/main/src/i...

kreeben · on March 18, 2022

I realize I'm asking for a free ride here, but could you explain what happens after the index scan? In a phrase search you'd need to intersect, union or remove from the results. Are you using roaring bitmaps or something similar?

josefcullhed · on March 18, 2022

We are currently just doing an intersection and then we make a lookup in a forward index to get the urls, titles and snippets.

I actually don't know what roaring bitmaps are, please enlighten me :)

kreeben · on March 18, 2022

If you are solely supporting union or solely supporting intersection then roaring bitmaps is probably not a perfect solution to any of your problems.

There are some algorithms that have been optimized for intersect, union, remove (OR, AND, NOT) that work extremely well for sorted lists but the problem is usually: how to efficiently sort the lists that you wish to perform boolean operations on, so that you can then apply the roaring bitmap algorithms on them.

https://roaringbitmap.org/

marginalia_nu · on March 19, 2022

Roaring Bitmaps are awesome. I use them when merging indices. I need to know which items to keep from the old index, so I'm calculating the intersection between two sets of a cardinality around 500,000,000. Without breaking a sweat.

Seirdy · on March 19, 2022

Oh boy, I have too many questions. I'd appreciate any answers you're able/willing to give:

1. Do you have any plans to support the parsing of any additional metadata (e.g. semantic HTML, microformats, schema.org structured data, open graph, dublin core, etc)?

2. How do you plan to address duplicate content? Engines like Google and Bing filter out pages containing the same content, which is welcome due to the amount of syndication that occurs online. `rel="canonical"` is a start, but it alone is not enough.

3. With the ranking algorithm being open-source, is there a plan to address SEO spam that takes advantage of Alexandria's ranking algo? I know this was an issue for Gigablast, which is why some parts of the repo fell out of sync with the live engine.

4. What are some of your favorite search engines? Have you considered collaboration with any?

IvanHall · on March 19, 2022

Hello, Ivan here (the other founder).

1. Yes, any structured data could definitely help improve the results, I personally like the Wikidata dataset. It's just a matter of time and resources :)

2. The first step will probably be to handle this in our "post processing". We query several servers when doing a search and often get many more results than we need and in this step we could quite easily remove identical results.

3. The ranking is currently heavily based on links (same as Google) so we will have similar issues. But hopefully we will find some ways to better determine what sites are actually trustworthy, perhaps with more manually verified sites if enough people would want to contribute.

4. I think that Gigablast and Marginalia Search are really cool and interesting to see how much can be done with a very small team.

Seirdy · on March 20, 2022

> Yes, any structured data could definitely help improve the results

Which syntaxes and vocabularies do you prefer? microformats, as well as schema.org vocabs represented Microdata or JSON-LD, seem to be the most common acc to the latest Web Data Commons Extraction Report[0]. The report is also powered by the Common Crawl.

[0]: http://webdatacommons.org/structureddata/2021-12/stats/stats...

badrabbit · on March 18, 2022

The UI is amazing. Don't change it significantly!

schemescape · on March 19, 2022

Very impressive work so far!

Apologies if I missed it (and solely out of curiosity), but how roughly much does hosting Alexandria Search cost (per month)? (I'm assuming you've optimized for cost to avoid spending your own money!)

I have some other questions (around crawlers, parsing, and dependencies), but I need to read the other comments first (to see if my questions have already been answered).

josefcullhed · on March 19, 2022

Thanks!

The active index is running on 4 servers and we have one server for hosting the frontend and the api (the API is what is used by the frontend, ex: https://api.alexandria.org/?q=hacker%20news)

Then we have one fileserver storing raw data to be indexed. The cost for those 6 servers are around 520 USD per month.

jw1224 · on March 19, 2022

I searched for a competitive keyword my SaaS business recently reached #1 on Google for. All of our competitors came up, but we were nowhere to be seen (I gave up after page 5).

Does this mean we’re not in Commoncrawl? Or are there any factors you weight much more heavily than Google might?

phrozbug · on March 18, 2022

What will be the USP that makes it a success we are all waiting for? At the moment I'm switching between DDG & Google.

josefcullhed · on March 18, 2022

I just think that the timing is right. I think we are in a spot in time where it does not cost billions of dollars to build a search engine like it did 20 years ago. The relevant parts of the internet is probably shrinking and Moore's Law is making computing exponentially cheaper so there has to be an inflection point somewhere.

We hope we can become a useful search engine powered by open source and donations instead of ads.

foobarandlmj · on March 19, 2022

hello, so i was studying B+ trees today, you see, in the morning i browsed hackernews and saw alexandria.org, opened the tab, kept it open, went about my day, got frustrated with my search results, noticed the alexandria tab and tried it, every result was meaningful, well done .

linspace · on March 19, 2022

Awesome work.

1. How do you plan to finance?

2. How will you avoid SEO?

3. What kind of help would be most welcome?

IvanHall · on March 19, 2022

Hello, Ivan here (the other founder).

1. We would prefer to be funded with donations like Wikipedia.

2. I don't think we can avoid it completely, perhaps with volunteers helping us determine the trustworthiness of websites. Do you have any suggestions?

3. I think programmers and people with experience raising money for nonprofits could help the most right now. But if you see some other way you would want to contribute, please let us know!

linspace · on March 19, 2022

Regarding the raise of money I wouldn't be surprised if given the current state of things in the EU you could manage to get some funding. I have no experience on it but there are companies specialized on helping with writing grant proposals.

tandr · on March 18, 2022

I think the fact that after a long while there are new search engines (Kagi was introduced very recently on HN, now this) should be a wake up call for Google - their search has lost some shine for quite a while. Hopefully something will come out of this - competition is good.

bmmayer1 · on March 18, 2022

My first search on Alexandria was "UTC time". Google gives me the current time in UTC, which is all I needed. Alexandria gave me...a lot of links to click to find what I'm looking for.

Google search is a lot better than people give it credit for.

s0rce · on March 18, 2022

I agree, Google's instant answers are quite good and have improved but actually searching to find a site seems to be getting worse and is riddled with paid sites on the top.

boomboomsubban · on March 18, 2022

When I need a clock I'll be sure to use Google.

marginalia_nu · on March 18, 2022

Strictly speaking, that's more in the domain of BonziBuddy or Alexa than internet search engine.

What Google arguably struggles with is surfacing relevant documents, that is... search.

tokai · on March 18, 2022

You still have the current UTC time in the top links. Googles knowledge graph is a part of the problem with their results.

rambambram · on March 18, 2022

This doesn't add up, at all. I have a clock on my computer. This new search engine doesn't function like a clock for you, so Google Search is better.

the-dude · on March 18, 2022

Kagi does this. I have switched 100% to Kagi, not affiliated.

aghilmort · on March 18, 2022

we're exploring adding instant answers in clean way at Breeze; leaning towards using open-source library &/or external API to compute vs. building in-house

also adding premium tier that's alerts + ad free + feeling lucky that would take user to top result, which is a UTC page, re: https://breezethat.com/?q=UTC+time

NoahTheDuke · on March 18, 2022

I just tried breezethat and had to scroll past 6 ads (two screenfuls on my iPhone 10) to see a single result. I know ads are necessary but this is punitive.

aghilmort · on March 18, 2022

yes, mobile is awful right now; we've mostly fixed that issue on laptop / desktop

4 of 6 are google's and have to include -- iterating some designs internally that refactor how they're presented on mobile

rg111 · on March 19, 2022

I have been using Kagi since it was introduced here on HN.

I find it absolutely great to use.

I find its results to be of much better quality than Google.

Google search shows a lot of SEOd results which are absolutely horrible in quality, filled with Adsense ads, and have Amazon affiliate links in them.

Kagi is a breath of fresh air for me.

I also use You.com sometimes. When I am exploring something for the first time, you.com is my place to go. It gives one a good lay of the land which is missing from others.

I still find Google to be the best for looking up code syntax, simple solutions and so on.

But when I am looking for something that depends on opinion, I find Kagi to show much better results and not SEO vomit. (When one wants direct facts even Bing is sufficient.)

On some days my Kagi usage surpasses my Google usage.

blinding-streak · on March 18, 2022

Competition is definitely good. But this thing is a toy compared to not just Google, but all the other major search players out there. Hopefully it will continue to advance.

amelius · on March 18, 2022

At some point, AI and NLP and raw processing power will have progressed so much that "search" is not a problem anymore, and I think we're getting there. Google can up their game but it won't matter much. The only thing they have left is brand recognition.

orlp · on March 18, 2022

IMO search has had its goalpost moved. It used to be about scale, technical challenges, bandwidth, storage, etc. It is still about that, but a significantly harder challenge to solve has come up: searching in a malicious environment. SEO crap nowadays completely dominates search, Google has lost the war.

Simply put, I believe that Google sucks at search, in the modern context. It is great at indexing, it has solved phenomenal technical challenges, but search it has not solved. Why do I have to write site:stackoverflow.com or site:reddit.com to skip the crap and go to actual content? Why can my brain detect blogspam garbage in 0.5 seconds of looking but billion dollar company Google will happily recommend it as the most relevant result above a legitimate website?

I feel this 12 year old XKCD is still relevant: https://xkcd.com/810/ .

new_guy · on March 18, 2022

> but billion dollar company Google will happily recommend it as the most relevant result above a legitimate website?

Because the site most likely is laden with Google Ads, it's in their interest to show you that garbage and not what you're actually looking for.

bricemo · on March 18, 2022

You can just append “Reddit” or “stackoverflow”, you don’t need to write “site:Reddit.com”

thewakalix · on March 19, 2022

No, I've seen SEO spam sites starting to take advantage of this trend by mentioning "reddit" on their pages.

tbihl · on March 18, 2022

Google has necessarily arbitrary criteria by which pages are ranked. Because Google is the game in town, anyone with a primary goal of driving traffic will pursue those metrics (i.e. SEO). To the extent that those criteria deviate even slightly from actual good results, large parts of the internet will dilute their content to pursue them, which both lowers their quality and further drives down the gems of the internet.

The ranking would have to vary over an infinite spread of purposes for webpages, and it would have to converge almost perfectly to what is actually most helpful. Among all the technical problems, Google will not optimize correctly against ads for the same reason that websites trying to drum up affiliate purchases and ad revenue won't put content quality above SEO.

When recipes return to having the recipe and ingredients first, followed by an optional life story,I'll revisit my assessment.

jll29 · on March 18, 2022

Google Research is also one of the top (NLR|IR) R&D gigs in town - they discovered BERT, a model that has re-defined how NLP is down and the respective paper describing it already collected 800 citations by the time it was published based on a pre-print spreading like a wildfire.

This technology is now part of Google search.

Xeoncross · on March 18, 2022

Thank you for building and sharing this. While many people rightly point out that this isn't a replacement for Google yet, the value of a shared working open source code base has been underestimated many times in the past.

I hope this is a project that grows to solve real needs in this space. However, even if it never makes it past this point, there is a chance someone will be inspired by this to construct their own version. Maybe in a different language with a different storage format or a different way of ranking results.

Thank you for sharing your work.

Minor49er · on March 18, 2022

This is really fast and cool. Looking for music-related pages, I've already found some interesting websites, like Wall of Ambient, which caters to ambient labels (https://wallofambient.com/#)

I noticed that if a term can't be found, there will be a random number of results that it says were found, but nothing is actually displayed. Eg: https://www.alexandria.org/?c=&r=&q=moonmusiq

I'll keep trying this out. It seems really promising

zeruch · on March 18, 2022

I was amused by the name (beyond the obvious reference to the ancient library, it also gender-switches on a heteronym of Fernado Pessoa, Alexander Search)

https://www.brown.edu/Departments/Portuguese_Brazilian_Studi...

jstx1 · on March 18, 2022

Randomly picking a search that I needed for work today - searching for "pandas order by list" says that it has 44 results and it shows only 3:

- a Github issue for dask

- an article about panda populations

- some coronavirus article that happens to have an unrelated snippet of pandas code

Google obviously picks the relevant stackoverflow thread as the first response.

bricemo · on March 19, 2022

I searched for "pizza near me". Google shows me several pizza places within a 2 mile distance. Alexandria gave me the link to papajohns.com.

rprenger · on March 18, 2022

For my first search of "GFlowNetworks" (which the search bar suggested) It said: Found 5,887 (or something) results, but showed no results

For my second I searched my name and got a Wikipedia article about a show I've never heard of which didn't have my name anywhere in it.

For my third I searched "GFlowNetworks" again and it said Found 2,656,844 results in 1.61s, but showed no results again

marginalia_nu · on March 18, 2022

I can't even find results for "GFlowNetworks" on google.

Seirdy · on March 18, 2022

I discovered Alexandria a month ago and was shocked by the quality of its results; while it's no Google/Bing replacement, it was leagues ahead of most other independent engines I'd collected and monitored for over a year[0].

[0]: https://seirdy.one/2021/03/10/search-engines-with-own-indexe...

It's earned a spot in my search bookmarks alongside Right Dao and Marginalia for when I want a bit of serendipity in my results.

I also appreciate the minimal UI, without any JavScript or other unnecessary bells and whistles.

dimitar · on March 18, 2022

Unfortunately it seems it doesn't support Cyrillic or Bulgarian well. I googled the mayor of the city I live in and there are 5 results all irrelevant. Unfortunately the experience in 'minor' languages is consistently bad in all alternative search engines.

tandr · on March 18, 2022

We don't know resources behind this project. But even if they substantial, still - they have to start small. It will come, give it time.

Rich_Morin · on March 18, 2022

I just tried out this search engine and was very favorably impressed. It was quite responsive (though that could be affected by demand) and gave good results. I really like the lack of goo (e.g., ads) and the spare, clean presentation. I think it might be a great search engine for visually disabled users who rely on screen readers.

0xbadcafebee · on March 18, 2022

Why don't search engines have filters? Every single consumer retail website's search uses filters to help shoppers find something to buy. It is way more convenient than hoping the user can guess the magic search phrase to find the thing they're looking for (if they even know what that thing is).

outcoldman · on March 18, 2022

Tried a few searches.

https://www.alexandria.org/?c=&r=&q=real+estates+puerto+esco... - 3 results only :( If you correct it to "real estate puerto Escondido" - that works better https://www.alexandria.org/?c=&r=&q=real+estate+puerto+escon...

A lot to improve. But a good start

kreeben · on March 18, 2022

I love the shortcut Alexandria takes by indexing Common Crawl instead of crawling the web themselves. It's how I would have bootstrapped a new search engine. In a future iteration they can start crawling themselves, if there is sufficient interest from the public.

Searching is screamingly fast.

The index seems stale, though. Alexandria, how old is your index?

How long did it take you to create your current index? Is that your bottleneck, perhaps, that it takes you a long time (and lots of money?) to create a Common Crawl index?

joshuamorton · on March 18, 2022

> The index seems stale, though. Alexandria, how old is your index?

Common crawl indexes about once every 40 days, the current crawl's data is through January 2022, so it's 1.5 months old at best.

51Cards · on March 18, 2022

So, interesting thing, how when I visit this site for the first time (in Firefox) is the search box showing a drop down with a bunch of my previous searches? I can't tell where they are from but it is all stuff I have searched for in the past. I thought it might be the browser populating a list but that should be based on same domain. So where is it pulling this from? Some of the searche terms are months, perhaps more than a year old.

tzot · on March 19, 2022

I can only assume that Firefox associates filled-in data with the name of the input control; in this case, “q”, which is probably typical for a search inputbox.

waynecochran · on March 18, 2022

In a nutshell, what is the fundamental difference with this search engine compared to others?

stazz1 · on March 18, 2022

>About Alexandria

Alexandria.org is a non-profit, ad free search engine. Our goal is to provide the best available information without compromise.

The index is built on data from Common Crawl and the engine is written in C++. The source code is available here.

We are still at an early stage of development and running the search engine on a shoestring budget.

Please contact us at -email- if you want to get involved, want to support this initiative or have any questions.

waynecochran · on March 18, 2022

But what is different in terms of its indexing algorithm? The original secret sauce for google was the pagerank algorithm which was mathematically genius. Are you using a similar algorithm.

josefcullhed · on March 18, 2022

Founder here. We are using harmonic centrality instead of pagerank. But of course much more work needs to be done to make the search engine usable.

Seirdy · on March 18, 2022

I'm curious as to which reasons/tradeoffs were involved in the decision to use harmonic centrality, if you wouldn't mind sharing.

Minor49er · on March 18, 2022

Their About page has you covered:

> Alexandria.org is a non-profit, ad free search engine. Our goal is to provide the best available information without compromise.

> The index is built on data from Common Crawl and the engine is written in C++. The source code is available (at https://github.com/alexandria-org#).

Edit: formatting

greenyoda · on March 18, 2022

More about Common Crawl: https://en.wikipedia.org/wiki/Common_Crawl

> Common Crawl is a nonprofit 501(c)(3) organization that crawls the web and freely provides its archives and datasets to the public. Common Crawl's web archive consists of petabytes of data collected since 2011. It completes crawls generally every month. ...

drcongo · on March 18, 2022

> Found 0 results in 0.05s

I have a test search string that I use to try out search engines, this one didn't do very well.

reaperducer · on March 18, 2022

Unless you were testing for speed!

doctor_eval · on March 18, 2022

Negative tests are still tests!

strongpigeon · on March 18, 2022

Slightly tangential, but does anyone know if there is a way to submit links to the Common Crawl (which Alexandria Search relies on)? I haven't seen any traffic from CCBot and my site doesn't seem to show up in Alexandria's results (compared to 2nd/3rd on Google for a bunch of queries).

kreeben · on March 18, 2022

You can verify whether or not your site exists in the CC data set by searching for it here: https://index.commoncrawl.org/

strongpigeon · on March 18, 2022

Thanks for that! It does look like it's in there and got crawled in January. I probably didn't search back far enough in my logs...

lukasb · on March 18, 2022

Excuse me while I get on a hobbyhorse - would love to use web search that lets me boost PageRank for certain sites (which then would carry over to sites they link to.) Could automatically boost PageRank for sites I subscribe to, for example. Expensive in terms of computation or storage? Charge me!

marginalia_nu · on March 18, 2022

That's easy to do (it's Personalized PageRank), but VERY expensive. Like just tossing them a few dollars doesn't cut it. You basically need your own custom index for that, as the way you achieve fast ranking is by sorting the documents in order of ranking within the index itself. That way you only need to consider a very small portion of the index to retrieve the highest ranking results.

You might get away with having like a custom micro-index where your search basically does a hidden site:-search for your favorite domain and related domains, but that's not quite going to do what you want it to do.

lukasb · on March 18, 2022

So uh ... 100 petabytes, then.

Ah.

marginalia_nu · on March 18, 2022

Realistically you could probably get away with something like a couple of terabytes, and then default to the regular index if it isn't found in the neighborhood close to your favored sites, but that's still anything but cheap especially this can't be some slow-ass S3 storage, this storage should ideally be SSDs or a RAID/JBOD-configuration of mechanical drives. That means you're also paying for a lot of I/O bandwidth and overall data logistics.

If you try to rent that sort of compute, you're probably looking at like $100-200/month.

devmunchies · on March 18, 2022

The initial commit was 11 months ago and written in C++. I haven’t done C++ since college ~7 years ago. Is it a good language for greenfield projects these days, or would something like Go or Rust (or Crystal, Nim, Zig) be better for maintainability and acquiring contributors?

extheat · on March 18, 2022

I don’t think the problem is there’s a shortage of developers. Much less the language. There’s a shortage of people with the experience in working with search engines and the necessary algorithms to make them work as intended reliably.

Shorel · on March 19, 2022

I would prefer D, but I ended up using C++ for one of my projects because it has all the libraries, and these smaller languages only have bindings for some of them.

So my answer is: if you can do all you need in the base language, something more modern like D-lang is preferred, but if you need some particular library, you either have to add all the bindings yourself, or use C++.

marginalia_nu · on March 18, 2022

Main thing I'd worry about with C++ is heap fragmentation over time. Something like TCMalloc or JEMalloc might help a bit but it's hard to get around doing this type of thing in C++.

hadjian · on March 18, 2022

This is really lovely. I think the search results are useful, if you're looking for more static information. Very pleasant, to see only results.

I think, I found a minor bug, while (of course) searching for my homepage:

https://www.alexandria.org/?c=&r=&q=www.hadjian.com

The status line below the search box says, that it found many results, but the results are empty. Also, when hitting F5 a couple of times, the number jumps around.

Keep up the great work. I think there is a lot potential to Common Crawl and things built on top of it.

gverrilla · on March 19, 2022

People say the problem is google, but the real problem is content. All these new search engines are searching on the same space, filled with crap.

bricemo · on March 19, 2022

Agreed. When people say they Google is returning poor results, I can never get an answer of what specific URL people actually want. Just general unhappiness and some mythical, vague, ideal result.

blondin · on March 19, 2022

google search often returns wrong, unrelated, or no result. i have been in that situation many times before and moved on to change my query.

i am (and most of us are) trying to solve my own issue. not google's.

that's why you get vague answers to your questions. it's not because it doesn't happen. it's because at that moment we care much more about solving our problem. that's what brought us to google search in the first place.

bricemo · on March 19, 2022

thanks for engaging, I'm trying to understand. can you give a specific example of a search query, and the ideal URL you think you should get as a result (that google is missing)? what is one query<->result example of a gap these independent engines are filling?

pmontra · on March 18, 2022

I'll use it for general searches in the next days because it's the only way to do a fair evaluation.

I just searched for python3 join string and I didn't get the Python docs in the first page. Both DDG and Google got them at position 9 which is way too low. At least I got a different set of random websites and not the usual tutorialpoint, w3schools, geeksforgeeks etc that I usually see in these cases.

potatoman22 · on March 18, 2022

It doesn't work well for programming queries :(

xerox13ster · on March 18, 2022

I searched fs js and the nodejs.org documentation was the first result.

glitcher · on March 18, 2022

Really like the minimal UI and the speed! Great work.

A few of my test searches came up with very useful results. However, one disappointment was searching for a javascript function, for example "javascript array splice", and the MDN site was not in the results. Adding "MDN" or "Mozilla" to the search did not help either.

hunter2_ · on March 18, 2022

The privacy settings are defaulting to unchecked, but the description above them suggests that they default to checked. This makes me wonder how the settings are actually being interpreted (i.e., what the actual initial state is).

pabs3 · on March 19, 2022

Interesting, searching for "Debian", the third result is the rustc package and the eighth is the GitHub repo for the Debian packaging of bino. I wonder how this search engine does its ranking.

julienreszka · on March 18, 2022

For a same query the number of results varies dramatically. So weird.

unmole · on March 18, 2022

I'm getting a 502 :(

pmarreck · on March 18, 2022

This actually makes me want to build my own web crawler and search

josefcullhed · on March 18, 2022

Founder here,

I suggest you start by not implementing a crawler but use commoncrawl.org instead. The problem with starting a web crawler is you will need a lot of money and almost all big websites are behind cloudflare so you will be blocked pretty quickly. Crawling is a big issue and most of the issues are non-technical.

Seirdy · on March 18, 2022

I've heard from other people who run engines (Right Dao, Gigablast) that this is a major problem; Common Crawl does look helpful, but it's not continuously updated. FWIW, Right Dao uses Wikipedia as a starting point for crawling. Kiwix makes pre-indexed dumps of Wikipedia, StackExchange, and other sites available.

Some sort of partnership between crawlers could go a long way. Have you considered contributing content back towards the Common Crawl?

marginalia_nu · on March 19, 2022

There seems to be a threshold where you get greylisted by cloudflare. Not sure if it's requests per day or what they're doing. But I've been able to mostly circumvent it by crawling at a modest rate.

pmarreck · on March 19, 2022

First off, nice work!

This seems like a reasonable fallback option but it's also a weaker one. By "most of the issues are non-technical", do you mean that you need special permission from someone like cloudflare to get "crawl rights"?

marginalia_nu · on March 18, 2022

Do try, it's a very interesting problem domain.

qumpis · on March 18, 2022

Is there a web that pools multiple search engines results?

Seirdy · on March 18, 2022

SearX and Searxng are the most common options, but instances often get blocked by the engines they use. Users need to switch between instances quite often.

eTools.ch uses commercial APIs so it doesn't get blocked, but it might block you instead (very sensitive bot detection).

Dogpile is one of the older metasearch engines, but I think it only uses Bing- and Google-powered engines.

aghilmort · on March 19, 2022

per other folks on thread, nearly impossible legally due to TOU/TOS or getting around those restrictions with rotating proxies, remote browsing + proxies, etc.

also part of why going with topic filter approach at Breeze -- so if you search all the web, we'll give you option to open others with indie indexes in new tab - say Mojeek or Yandex -- similar to what airline search engines do

if you switch to say code search, you can use ours or redirect to any one of say PublicWWW, Nerdy Data, or Builtwith

pretty much no other legal way to do it on the main

mikkom · on March 18, 2022

Search engines typically probihit this kind of usage via their TOS

marginalia_nu · on March 18, 2022

SearX?

blinding-streak · on March 18, 2022

Does this support phrase match? (ie "a query in quotes"). A few tries seems to show that it doesn't. Or, if it does, its corpus is tiny.

bghfm · on March 18, 2022

How can we help improve the project? Usage, feedback?

IvanHall · on March 19, 2022

Hello Ivan here (one of the founders)

At the moment we primarily need help with development and funding. But if you have suggestions or want to help in some other ways, please let us know!

byteski · on March 18, 2022

ive found several search engines/services besides G and ddg and the only thing that i cant figure out is do these search services have seo technique or its just random list of all resources? i mean how does it order search results

charcircuit · on March 18, 2022

Terrible results for "minecraft any% world record"

wilg · on March 18, 2022

which makes it very hard to follow-up on queries.The cursor moves back to the beginning of the search box after you search,

moonshinefe · on March 18, 2022

Unfortunately it seems down for me right now.

yosito · on March 18, 2022

Yep, I'm getting a 502

svnpenn · on March 18, 2022

doesnt work

https://alexandria.org?q=cannot+use+generic+function+without...

andreygrehov · on March 18, 2022

How does it work? The GitHub page is not very descriptive. I tried to search "Putin" and the first link is the NYTimes homepage. Does that mean NYTimes covers the war more than the other publications, or is it backlink-driven?

perlwle · on March 19, 2022

already blocked by GFW. sigh..

zander312 · on March 18, 2022

getting a 502...

endisneigh · on March 18, 2022

https://www.alexandria.org/?c=&r=&q=SPY+current+value

https://www.google.com/search?q=SPY+current+value&rlz=1C1ONG...

https://www.alexandria.org/?c=&r=&q=kggle

https://www.google.com/search?q=kggle&rlz=1C1ONGR_enUS974US9...

Search is hard.

marginalia_nu · on March 18, 2022

I'd argue "kggle" should surface this result:

https://stackoverflow.com/questions/44077294/encounter-this-...