There have been a few search engines out recently. I'm curious how people evaluate them quickly.
I've realized my searching is basically optimized for google and the web that has grown up around it. Also, in 1998 I wasn't as aware of what was out there as I am now. It's pretty rare (even if its possible) that I do a search and come across a completely new site that I haven't heard of before, for anything nontrivial. That was different when search began.
Google is now almost a convenience. If I have a coding question, I search for "turn list of tensors into tensor" or whatever but I'm really looking for SO or the pytorch documentation, and I'll ignore the geeksforgeeks and other seo spam that finds it's way it. It's almost like google is a statistical "portal" page, like Yahoo or one of those old cluttered sites were, that lets me quickly get through the menus by just searching. That's different from a blank slate search like we might have done 25 years ago.
I think what's really lacking now is uncorrupted search for anything that can be monetized. Like I tried to search for a frying pan once on google and it was unusable. I'm not sure any better search engine can fix that, that's why everyone appends "reddit" to queries they are looking for a real opinion on, again, because they are optimizing for the current state of the web.
Anyway, all that to say I think there are a lot of problems with (google dominated) search, but they are basically reflected in the current web overall, so just a better search engine, outside of stripping out the ads, can only do so much. Real improved search efforts need to somehow change the content that's out there at the same time as they improve they experience, and let us know how to, in a simple way, get the most out of it. I think google has a much deeper moat than most people realize
> I've realized my searching is basically optimized for google
Is it just me, or I feel like Google does not provide anymore good results for me.
Like every time I search something completely out of my knowledge, like "How to purchase a property in Mexico", it will give me 100+ results of some results with autogenerated content like "10 best places to buy property in Mexico". And the only way to fix that would be to add something like `site:reddit.com`
> Is it just me, or I feel like Google does not provide anymore good results for me.
I am starting to suspect that there might be nothing to find.
I just don't think people (other then the tech oriented) are creating websites and running forums - and why would they? Reddit might be be only place you _can_ find that type of content. What should search engines do then?
With a tiny number of exceptions, it might be that people chat on reddit, read Wikipedia, ask questions on the stackexchange network/Quora, local communities use facebook groups, and businesses have a wordpress site with nothing more then a bit of fluff, a phone number and an email address.
Kagi's "noncommercial" lens, search.marginalia.nu, and engines that don't parse JS (e.g. Mojeek) can add some variety. Another thing I like doing is adding phrases like "creative commons" to already-long queries to filter out some corporate results, or adding `-gdpr -ccpa -"sell my info"` to limit results to sites made by small orgs and individuals who don't collect enough data or make enough money to warrant compliance measures.
If all websites try to optimise for SEO, they undermine the assumption that the evaluation of a search engine is the pure consequence of how well a site satisfies a query.
I really think than one, we are going to have to end up with search engines managing a curated list of 'roots', and two, those roots are going to end up consisting substantially of a mix of more 'human' sites and, let's be honest with ourselves, a certain amount of content that is paying for favoritism.
I think it's very possible that we have effectively raised the noise floor so high that there is no signal, but also likely that perverse incentives from trying to profit off of search engines have made them our enemies instead of our friends.
For instance, does Google favor sites that run their own tools on them? I've stopped paying attention but recall hearing mutterings to that effect. If so then running the tools is a protection racket.
For other perverse incentives: if you try to rank sites by how long someone stays on them before backing out, or searching again, then you end up favoring rabbit-hole sites, that either string you along or suck you into a tangent. "Oh, this must have answered their question about keeping bees," no, they're reading gossip about the Queen of England and have forgotten all about beekeeping.
Oh, look! It’s the same „flaw“ people have been explaining each other for a decade. We’ve all seen it dozens of times, yet Google apparently remains in the dark.
> because they are optimizing for the current state of the web.
I believe people will at least start looking for alternatives. For example, I have been collecting search engines, and whenever I encounter a page with too many commercial-laden SEO-porked results, I use a different search engine in Firefox.
I have enabled the Search Bar, I can do Alt+D, Tab, Tab, enter my query, then click a different search engine, which searches instantly, unlike the main bar, where you have to press Enter once more after clicking.
Pro tip: Alt+E takes you directly to the search bar, then you can press Tab for selecting the search engine. The best part is that you never use the mouse this way. You can also use ddg bangs, they contain every search-engine/site by pressing Alt-D if you remember the bang for the site.
And it did not work. I hit a brick wall. I completely lost trust in Firefox. I want a browser created by a non-profit. Thank you Google for corrupting everything you touch.
Some of these (Andi, nee Lazyweb; You; SwissCows) are Bing proxies. Gnod is a search launcher, not an engine unto itself.
Many more installable engines are available at https://mycroftproject.com as OpenSearch XML plugins, compatible with Firefox and discoverable by Chromium.
I've been wondering for a while now about building a search engine for the ad free web. That is, penalize or outright refuse to index any recognized advertising network, letting through only those sites which don't perform invasive tracking with third party services. Mostly as a curiosity: what would be left? What would rise to the top when you filter all of that out?
I've thought about something similar, basically "the good internet" that would be a hand curated list of sites that are not there just as a pretext for ads. I think a lot of software project documentation qualifies, lots of stuff on university sites like lectures notes for example. I assume that across different niches there is other stuff like this. I think the key would be something that can't be gamed, like it has to be legitimate content that is online for an existing purpose and not as a pretext.
exploring adding that search / similar as breeze filter at https://breezethat.com/- not sure can do with curation + bing or google, may have to wait until we bring our scraper out of testing
Evidently someone disagrees, but no ads or trackers except on the home page and its pages rank highly on current search engines, so if you exclude trackers and ads that's what you're going to get.
DDG's organic link results are from Bing, sans personalization. DuckDuckGo advertises using "over 400 sources", which means that at least 399 sources only power infoboxes ("instant answers") and non-generalist search, such as the Video search.
The "Web track" task at the annual US NIST TREC conference ( https://trec.nist.gov/
) is an open innovation benchmark that everyone can contribute; participants get a set of queries that they have to run on exactly the same corpus. Then they return the top-k results to a team that evaluates them.
(Founder of Neeva) Eval for search engines is about as hard as ranking for search engines. You need rating templates, raters, querysets, tooling and lots of time and patience. There are a number of vendors who can help you with the raters part, but the rest is still painstaking work you have to do yourself. Email me if you need help; we are happy to share findings.
There are benchmarks within the adjacent field of information retrieval, but in general it's hard to properly validate a search engine because real data is so noisy and misbehaved, and sample data is so different from real data.
Sure, the problem of information retrieval is not exactly that of web search but they're pretty close. So, from such a knowledgeable person such as yourself, when it comes to this topic, could you remind us, what are some of those benchmarks?
I'd love to be able to add a tag to a search to have it exclude sites with any kind of monetization, I know that's not realistic cuz that's where Google makes most of its money (or do they make most of their money somewhere other than advertising these days?). Anyways, yeah, I'm sick of SEO optimized, click optimized, advertising optimized, affiliate link optimized crap.
(Founder of Neeva here) -- Chris -- I love the idea. Neeva does this in a contextually relevant manner to the intent of the query. For example, on health queries, we label all health sites as "trusted", "ad-supported" etc. and allow you to filter down to the appropriate subset of results. For programming queries, we label sites as "official sites", "forums", "blogs", "code repos", "programming websites" (the SEO-ed ones). We work with human raters to do this. Would love to hear if you find it useful and what other labels would be of use.
I'm surprised you are not a fan of geeksforgeeks. While each of their webpages have substantially less content than the pytorch docs or SO result, I find that they get to the point instantly. My mean time to solution from G4G is definitely smaller than SO.
I guess everyone has their go-to sites and their pet peeves. Geeksforgeeks may be less spammy than some, but I still think of it as that annoying site that got in the way of either the SO or documentation answer that I was looking for.
Just to expand, if I want the api reference, say I search for defaultdict (for some reason I like using them but always have to look at the reference), I want the python documentation. I definitely don't want a third party telling me about it.
And if I search a "make list of tensor into tensor" type question, I want SO where someone had asked the same question and got "tensor.stack" as the reply, so I can understand the answer and follow up by looking at the tensor.stack pytorch reference it I want.
Anything else is wasting my time, I think most users with similarly specific queries are not looking for tutorials, they are looking for the names of functions they hypothesize exist, or references. That's why intermediary sites that try to give an explanation are annoying, at least for me.
I generally find that sites like SO, GFG, etc. often play the role of "Reading the Docs as a Service". I prefer using them only after official documentation or specifications fail me. When I want an opinionated answer, I just ping some people I already know or check the personal websites of the developers of the language/tool I'm using. If I have further questions, I check the relevant IRC channel. Sites like SO are a "last resort" for me.
In other words, I'd rather see these at the bottom of the SERP than the top, but I wouldn't want to completely eliminate them.
I've found that their content is often inaccurate or written by people who come across as novices. I actually emailed them to correct an inaccuracy in one of their articles once, which they did, so kudos to them for that.
Around the same time, I also discovered sengine.info, Artado, Entfer, and Siik. By sheer coincidence they all were mentioned to me or decided to crawl my site within the same couple weeks. So yes, from my perspective there have been more than a few new smaller engines getting active on the heels of bigger names like Neeva, Kagi, Brave Search, etc.
I've been using the Kagi beta for a few months now, and it's awesome: https://kagi.com/
The biggest thing I've found is that when doing technical searches it always turns up the sources I'm actually looking for, actively filtering all the GitHub/StackOverflow copycat sites.
It also seems to up-weight official docs compared to Google. For example, "how to read a json file in python" turns up the Python docs as the second result, where in Google they're nowhere to be found.
1. free version is curated topic filters that sit on top of Google -- best on laptop or desktop at moment / iterating mobile due to ad splash; it's unclear due to TOU that we can ever make that fully server-side legally, though do so have some things testing to see if free version can be ad-free or more tracking free
2. premium version will be mix of our scraping and Bing, depending on topic - standard web search via Bing + same curation as (1), closer to real-time for us
3. have tested out most other indie indexes or sites, for full web scale, money is on ahrefs or brave giving bing / google run for it
4. our primary emphasis is on bringing back some of the Yahoo! directory or Alta Vista look & feel of drilling into topics, so balancing 1-3 (^) as best we can atm, small team, fully bootstrapped modulo tiny F&F round
5. also @DotDotJames on twitter, still iterating when add team info to site
I've been keeping my eye on You.com, tracking a few SERPs over time compared to other Bing- and Google-based engines. So far, the results don't seem independent.
Try comparing results with a Bing-based engine (e.g. DuckDuckGo) or a Google-based one (e.g. StartPage, GMX) to see if they differ. (Don't use Google or Bing directly, since results will be personalized based on factors like location, device, your fingerprint, etc.).
maybe the solution is to make google itself reddit style. let users downvote the seo spam websites and allow them to be downranked.
sure it opens the door for a different kind of abuse... but maybe that problem is more fixable?
To paint with a broad brush, I look at three criteria:
1. Infoboxes ("instant answers") should focus on site previews rather than trying to intelligently answer my question. Most DuckDuckGo infoboxes are good examples of this; Bing and Google ones are too "clever".
2. Organic results should be unique; most engines are powered by a commercial Bing API or use Google Custom Search. Compare results with a Bing or Google proxy (duckduckgo, startpage, etc) to avoid personalized results. Monitor queries over time to see if SERPs change in ways that diverge from Google/Bing/Yandex.
3. "other" stuff. Common features I find appealing include area-specific search (Kagi has a "non-commercial lens" mostly powered by its Teclis index; Brave is rolling out "goggles"), displaying additional info about each result (Marginalia and Kagi highlight results with heavy JS or tracking), user-driven SERP personalization (Neeva and Kagi allow promoting/demoting domains), etc.
And always check privacy policies, TOS, GDPR/CCPA compliance, etc.
> Google is now almost a convenience. If I have a coding question, I search for "turn list of tensors into tensor" or whatever but I'm really looking for SO or the pytorch documentation, and I'll ignore the geeksforgeeks and other seo spam that finds it's way it. It's almost like google is a statistical "portal" page,
I like engines like Neeva and Kagi that allow customizing SERPs by demoting irrelevant results; I demote crap like GFG, w3schools, tutorialspoint, dev(.)to, etc. and promote official documentation. Alternatively, you can use an adblocker to block results matching a pattern: https://reddit.com/hgqi5o
My name is Josef Cullhed. I am the programmer of alexandria.org and one of two founders. We want to build an open source and non profit search engine and right now we are developing in our spare time and are funding the servers ourselves. We are indexing commoncrawl and the search engine is in a really early stage.
We would be super happy to find more developers who want to help us.
I was trying to learn more about the ranking algorithm that Alexandria uses, and I was a bit confused by the documentation on Github for it. Would I be correct in that it uses "Harmonic Centrality" (http://vigna.di.unimi.it/ftp/papers/AxiomsForCentrality.pdf) at least for part of the algorithm?
Yes our documentation is probably pretty confusing. It works like this, the base score for all URLs to a specific domain is the harmonic centrality (hc).
Then we have two indexes, one with URLs and one with links (we index the link text).
Then we first make a search on the links, then on the URLs. We then update the score of the urls based on the links with this formula:
domain_score = expm1(5 * link.m_score) + 0.1;
url_score = expm1(10 * link.m_score) + 0.1;
then we add the domain and url score to url.m_score
where link.m_score is the HC of the source domain.
The main scoring function seems to be index_builder<data_record>::calculate_score_for_record() in line 296 of https://github.com/alexandria-org/alexandria/blob/main/src/i..., and it mentions support for BM25 (Spärck Jones, Walker and Robertson, 1976) and TFIDF (Spärck Jones, 1972) term weighting, pointing to the respective Wikipedia pages.
Thanks for sharing this with the world. Did you manage to include all of a common crawl in an index? How long did that take you to produce such an index? Is your index in-memory or on disk?
I'd consider contributing. Seems you have something here.
The index we are running right now are all URLs in commoncrawl from 2021 but only URLs with direct links to them. This is mostly because we would need more servers to index more URLs and that would increase the cost.
It takes us a couple of days to build the index but we have been coding this for about 1 year.
Love it. Makes for a cheaper infrastructure, since SSD is cheaper than RAM.
>> It takes us a couple of days to build the index
It's hard for me to see how that could be done much faster unless you find a way to parallelize the process, which in itself is a terrifyingly hard problem.
I haven't read your code yet, obviously, but could you give us a hint as to what kind of data structure you use for indexing? According to you, what kind of data structure allows for the fastest indexing and how do you represent it on disk so that you can read your on-disk index in a forward-only mode or "as fast as possible"?
Yes it would be impossible to keep the index in RAM.
>> It's hard for me to see how that could be done much faster unless you find a way to parallelize the process
We actually parallelize the process. We do it by separating the URLs to three different servers and indexing them separately. Then we just make the searches on all three servers and merges the result URLs.
>> I haven't read your code yet, obviously, but could you give us a hint as to what kind of data structure you use for indexing?
It is not very complicated, we use hashes a lot to simplify things. The index is basically a really large hash table with the word_hash -> [list of url hashes]
Then if you search for "The lazy fox" we just take the intersection between the three lists of url hashes to get all the urls which have all words in them. This is the basic idea that is implemented right now but we will of course try to improve.
I realize I'm asking for a free ride here, but could you explain what happens after the index scan? In a phrase search you'd need to intersect, union or remove from the results. Are you using roaring bitmaps or something similar?
If you are solely supporting union or solely supporting intersection then roaring bitmaps is probably not a perfect solution to any of your problems.
There are some algorithms that have been optimized for intersect, union, remove (OR, AND, NOT) that work extremely well for sorted lists but the problem is usually: how to efficiently sort the lists that you wish to perform boolean operations on, so that you can then apply the roaring bitmap algorithms on them.
Roaring Bitmaps are awesome. I use them when merging indices. I need to know which items to keep from the old index, so I'm calculating the intersection between two sets of a cardinality around 500,000,000. Without breaking a sweat.
Oh boy, I have too many questions. I'd appreciate any answers you're able/willing to give:
1. Do you have any plans to support the parsing of any additional metadata (e.g. semantic HTML, microformats, schema.org structured data, open graph, dublin core, etc)?
2. How do you plan to address duplicate content? Engines like Google and Bing filter out pages containing the same content, which is welcome due to the amount of syndication that occurs online. `rel="canonical"` is a start, but it alone is not enough.
3. With the ranking algorithm being open-source, is there a plan to address SEO spam that takes advantage of Alexandria's ranking algo? I know this was an issue for Gigablast, which is why some parts of the repo fell out of sync with the live engine.
4. What are some of your favorite search engines? Have you considered collaboration with any?
1. Yes, any structured data could definitely help improve the results, I personally like the Wikidata dataset. It's just a matter of time and resources :)
2. The first step will probably be to handle this in our "post processing". We query several servers when doing a search and often get many more results than we need and in this step we could quite easily remove identical results.
3. The ranking is currently heavily based on links (same as Google) so we will have similar issues. But hopefully we will find some ways to better determine what sites are actually trustworthy, perhaps with more manually verified sites if enough people would want to contribute.
4. I think that Gigablast and Marginalia Search are really cool and interesting to see how much can be done with a very small team.
> Yes, any structured data could definitely help improve the results
Which syntaxes and vocabularies do you prefer? microformats, as well as schema.org vocabs represented Microdata or JSON-LD, seem to be the most common acc to the latest Web Data Commons Extraction Report[0]. The report is also powered by the Common Crawl.
Apologies if I missed it (and solely out of curiosity), but how roughly much does hosting Alexandria Search cost (per month)? (I'm assuming you've optimized for cost to avoid spending your own money!)
I have some other questions (around crawlers, parsing, and dependencies), but I need to read the other comments first (to see if my questions have already been answered).
The active index is running on 4 servers and we have one server for hosting the frontend and the api (the API is what is used by the frontend, ex: https://api.alexandria.org/?q=hacker%20news)
Then we have one fileserver storing raw data to be indexed. The cost for those 6 servers are around 520 USD per month.
I searched for a competitive keyword my SaaS business recently reached #1 on Google for. All of our competitors came up, but we were nowhere to be seen (I gave up after page 5).
Does this mean we’re not in Commoncrawl? Or are there any factors you weight much more heavily than Google might?
I just think that the timing is right.
I think we are in a spot in time where it does not cost billions of dollars to build a search engine like it did 20 years ago. The relevant parts of the internet is probably shrinking and Moore's Law is making computing exponentially cheaper so there has to be an inflection point somewhere.
We hope we can become a useful search engine powered by open source and donations instead of ads.
hello, so i was studying B+ trees today, you see, in the morning i browsed hackernews and saw alexandria.org, opened the tab, kept it open, went about my day, got frustrated with my search results, noticed the alexandria tab and tried it, every result was meaningful, well done .
1. We would prefer to be funded with donations like Wikipedia.
2. I don't think we can avoid it completely, perhaps with volunteers helping us determine the trustworthiness of websites. Do you have any suggestions?
3. I think programmers and people with experience raising money for nonprofits could help the most right now. But if you see some other way you would want to contribute, please let us know!
Regarding the raise of money I wouldn't be surprised if given the current state of things in the EU you could manage to get some funding. I have no experience on it but there are companies specialized on helping with writing grant proposals.
I think the fact that after a long while there are new search engines (Kagi was introduced very recently on HN, now this) should be a wake up call for Google - their search has lost some shine for quite a while. Hopefully something will come out of this - competition is good.
My first search on Alexandria was "UTC time". Google gives me the current time in UTC, which is all I needed. Alexandria gave me...a lot of links to click to find what I'm looking for.
Google search is a lot better than people give it credit for.
I agree, Google's instant answers are quite good and have improved but actually searching to find a site seems to be getting worse and is riddled with paid sites on the top.
we're exploring adding instant answers in clean way at Breeze; leaning towards using open-source library &/or external API to compute vs. building in-house
also adding premium tier that's alerts + ad free + feeling lucky that would take user to top result, which is a UTC page, re: https://breezethat.com/?q=UTC+time
I just tried breezethat and had to scroll past 6 ads (two screenfuls on my iPhone 10) to see a single result. I know ads are necessary but this is punitive.
I have been using Kagi since it was introduced here on HN.
I find it absolutely great to use.
I find its results to be of much better quality than Google.
Google search shows a lot of SEOd results which are absolutely horrible in quality, filled with Adsense ads, and have Amazon affiliate links in them.
Kagi is a breath of fresh air for me.
I also use You.com sometimes. When I am exploring something for the first time, you.com is my place to go. It gives one a good lay of the land which is missing from others.
I still find Google to be the best for looking up code syntax, simple solutions and so on.
But when I am looking for something that depends on opinion, I find Kagi to show much better results and not SEO vomit. (When one wants direct facts even Bing is sufficient.)
On some days my Kagi usage surpasses my Google usage.
Competition is definitely good. But this thing is a toy compared to not just Google, but all the other major search players out there. Hopefully it will continue to advance.
At some point, AI and NLP and raw processing power will have progressed so much that "search" is not a problem anymore, and I think we're getting there. Google can up their game but it won't matter much. The only thing they have left is brand recognition.
IMO search has had its goalpost moved. It used to be about scale, technical challenges, bandwidth, storage, etc. It is still about that, but a significantly harder challenge to solve has come up: searching in a malicious environment. SEO crap nowadays completely dominates search, Google has lost the war.
Simply put, I believe that Google sucks at search, in the modern context. It is great at indexing, it has solved phenomenal technical challenges, but search it has not solved. Why do I have to write site:stackoverflow.com or site:reddit.com to skip the crap and go to actual content? Why can my brain detect blogspam garbage in 0.5 seconds of looking but billion dollar company Google will happily recommend it as the most relevant result above a legitimate website?
Google has necessarily arbitrary criteria by which pages are ranked. Because Google is the game in town, anyone with a primary goal of driving traffic will pursue those metrics (i.e. SEO). To the extent that those criteria deviate even slightly from actual good results, large parts of the internet will dilute their content to pursue them, which both lowers their quality and further drives down the gems of the internet.
The ranking would have to vary over an infinite spread of purposes for webpages, and it would have to converge almost perfectly to what is actually most helpful. Among all the technical problems, Google will not optimize correctly against ads for the same reason that websites trying to drum up affiliate purchases and ad revenue won't put content quality above SEO.
When recipes return to having the recipe and ingredients first, followed by an optional life story,I'll revisit my assessment.
Google Research is also one of the top (NLR|IR) R&D gigs in town - they discovered BERT, a model that has re-defined how NLP is down and the respective paper describing it already collected 800 citations by the time it was published based on a pre-print spreading like a wildfire.
Thank you for building and sharing this. While many people rightly point out that this isn't a replacement for Google yet, the value of a shared working open source code base has been underestimated many times in the past.
I hope this is a project that grows to solve real needs in this space. However, even if it never makes it past this point, there is a chance someone will be inspired by this to construct their own version. Maybe in a different language with a different storage format or a different way of ranking results.
This is really fast and cool. Looking for music-related pages, I've already found some interesting websites, like Wall of Ambient, which caters to ambient labels (https://wallofambient.com/#)
I noticed that if a term can't be found, there will be a random number of results that it says were found, but nothing is actually displayed. Eg: https://www.alexandria.org/?c=&r=&q=moonmusiq
I'll keep trying this out. It seems really promising
I was amused by the name (beyond the obvious reference to the ancient library, it also gender-switches on a heteronym of Fernado Pessoa, Alexander Search)
I discovered Alexandria a month ago and was shocked by the quality of its results; while it's no Google/Bing replacement, it was leagues ahead of most other independent engines I'd collected and monitored for over a year[0].
Unfortunately it seems it doesn't support Cyrillic or Bulgarian well. I googled the mayor of the city I live in and there are 5 results all irrelevant. Unfortunately the experience in 'minor' languages is consistently bad in all alternative search engines.
I just tried out this search engine and was very favorably impressed. It was quite responsive (though that could be affected by demand) and gave good results. I really like the lack of goo (e.g., ads) and the spare, clean presentation. I think it might be a great search engine for visually disabled users who rely on screen readers.
Why don't search engines have filters? Every single consumer retail website's search uses filters to help shoppers find something to buy. It is way more convenient than hoping the user can guess the magic search phrase to find the thing they're looking for (if they even know what that thing is).
I love the shortcut Alexandria takes by indexing Common Crawl instead of crawling the web themselves. It's how I would have bootstrapped a new search engine. In a future iteration they can start crawling themselves, if there is sufficient interest from the public.
Searching is screamingly fast.
The index seems stale, though. Alexandria, how old is your index?
How long did it take you to create your current index? Is that your bottleneck, perhaps, that it takes you a long time (and lots of money?) to create a Common Crawl index?
So, interesting thing, how when I visit this site for the first time (in Firefox) is the search box showing a drop down with a bunch of my previous searches? I can't tell where they are from but it is all stuff I have searched for in the past. I thought it might be the browser populating a list but that should be based on same domain. So where is it pulling this from? Some of the searche terms are months, perhaps more than a year old.
I can only assume that Firefox associates filled-in data with the name of the input control; in this case, “q”, which is probably typical for a search inputbox.
But what is different in terms of its indexing algorithm? The original secret sauce for google was the pagerank algorithm which was mathematically genius. Are you using a similar algorithm.
> Common Crawl is a nonprofit 501(c)(3) organization that crawls the web and freely provides its archives and datasets to the public. Common Crawl's web archive consists of petabytes of data collected since 2011. It completes crawls generally every month. ...
Slightly tangential, but does anyone know if there is a way to submit links to the Common Crawl (which Alexandria Search relies on)? I haven't seen any traffic from CCBot and my site doesn't seem to show up in Alexandria's results (compared to 2nd/3rd on Google for a bunch of queries).
Excuse me while I get on a hobbyhorse - would love to use web search that lets me boost PageRank for certain sites (which then would carry over to sites they link to.) Could automatically boost PageRank for sites I subscribe to, for example. Expensive in terms of computation or storage? Charge me!
That's easy to do (it's Personalized PageRank), but VERY expensive. Like just tossing them a few dollars doesn't cut it. You basically need your own custom index for that, as the way you achieve fast ranking is by sorting the documents in order of ranking within the index itself. That way you only need to consider a very small portion of the index to retrieve the highest ranking results.
You might get away with having like a custom micro-index where your search basically does a hidden site:-search for your favorite domain and related domains, but that's not quite going to do what you want it to do.
Realistically you could probably get away with something like a couple of terabytes, and then default to the regular index if it isn't found in the neighborhood close to your favored sites, but that's still anything but cheap especially this can't be some slow-ass S3 storage, this storage should ideally be SSDs or a RAID/JBOD-configuration of mechanical drives. That means you're also paying for a lot of I/O bandwidth and overall data logistics.
If you try to rent that sort of compute, you're probably looking at like $100-200/month.
The initial commit was 11 months ago and written in C++. I haven’t done C++ since college ~7 years ago. Is it a good language for greenfield projects these days, or would something like Go or Rust (or Crystal, Nim, Zig) be better for maintainability and acquiring contributors?
I don’t think the problem is there’s a shortage of developers. Much less the language. There’s a shortage of people with the experience in working with search engines and the necessary algorithms to make them work as intended reliably.
I would prefer D, but I ended up using C++ for one of my projects because it has all the libraries, and these smaller languages only have bindings for some of them.
So my answer is: if you can do all you need in the base language, something more modern like D-lang is preferred, but if you need some particular library, you either have to add all the bindings yourself, or use C++.
Main thing I'd worry about with C++ is heap fragmentation over time. Something like TCMalloc or JEMalloc might help a bit but it's hard to get around doing this type of thing in C++.
The status line below the search box says, that it found many results, but the results are empty. Also, when hitting F5 a couple of times, the number jumps around.
Keep up the great work. I think there is a lot potential to Common Crawl and things built on top of it.
Agreed. When people say they Google is returning poor results, I can never get an answer of what specific URL people actually want. Just general unhappiness and some mythical, vague, ideal result.
google search often returns wrong, unrelated, or no result. i have been in that situation many times before and moved on to change my query.
i am (and most of us are) trying to solve my own issue. not google's.
that's why you get vague answers to your questions. it's not because it doesn't happen. it's because at that moment we care much more about solving our problem. that's what brought us to google search in the first place.
thanks for engaging, I'm trying to understand. can you give a specific example of a search query, and the ideal URL you think you should get as a result (that google is missing)? what is one query<->result example of a gap these independent engines are filling?
I'll use it for general searches in the next days because it's the only way to do a fair evaluation.
I just searched for python3 join string and I didn't get the Python docs in the first page. Both DDG and Google got them at position 9 which is way too low. At least I got a different set of random websites and not the usual tutorialpoint, w3schools, geeksforgeeks etc that I usually see in these cases.
Really like the minimal UI and the speed! Great work.
A few of my test searches came up with very useful results. However, one disappointment was searching for a javascript function, for example "javascript array splice", and the MDN site was not in the results. Adding "MDN" or "Mozilla" to the search did not help either.
The privacy settings are defaulting to unchecked, but the description above them suggests that they default to checked. This makes me wonder how the settings are actually being interpreted (i.e., what the actual initial state is).
Interesting, searching for "Debian", the third result is the rustc package and the eighth is the GitHub repo for the Debian packaging of bino. I wonder how this search engine does its ranking.
I suggest you start by not implementing a crawler but use commoncrawl.org instead. The problem with starting a web crawler is you will need a lot of money and almost all big websites are behind cloudflare so you will be blocked pretty quickly. Crawling is a big issue and most of the issues are non-technical.
I've heard from other people who run engines (Right Dao, Gigablast) that this is a major problem; Common Crawl does look helpful, but it's not continuously updated. FWIW, Right Dao uses Wikipedia as a starting point for crawling. Kiwix makes pre-indexed dumps of Wikipedia, StackExchange, and other sites available.
Some sort of partnership between crawlers could go a long way. Have you considered contributing content back towards the Common Crawl?
There seems to be a threshold where you get greylisted by cloudflare. Not sure if it's requests per day or what they're doing. But I've been able to mostly circumvent it by crawling at a modest rate.
This seems like a reasonable fallback option but it's also a weaker one. By "most of the issues are non-technical", do you mean that you need special permission from someone like cloudflare to get "crawl rights"?
SearX and Searxng are the most common options, but instances often get blocked by the engines they use. Users need to switch between instances quite often.
eTools.ch uses commercial APIs so it doesn't get blocked, but it might block you instead (very sensitive bot detection).
Dogpile is one of the older metasearch engines, but I think it only uses Bing- and Google-powered engines.
per other folks on thread, nearly impossible legally due to TOU/TOS or getting around those restrictions with rotating proxies, remote browsing + proxies, etc.
also part of why going with topic filter approach at Breeze -- so if you search all the web, we'll give you option to open others with indie indexes in new tab - say Mojeek or Yandex -- similar to what airline search engines do
if you switch to say code search, you can use ours or redirect to any one of say PublicWWW, Nerdy Data, or Builtwith
pretty much no other legal way to do it on the main
ive found several search engines/services besides G and ddg and the only thing that i cant figure out is do these search services have seo technique or its just random list of all resources? i mean how does it order search results
How does it work? The GitHub page is not very descriptive. I tried to search "Putin" and the first link is the NYTimes homepage. Does that mean NYTimes covers the war more than the other publications, or is it backlink-driven?
I've realized my searching is basically optimized for google and the web that has grown up around it. Also, in 1998 I wasn't as aware of what was out there as I am now. It's pretty rare (even if its possible) that I do a search and come across a completely new site that I haven't heard of before, for anything nontrivial. That was different when search began.
Google is now almost a convenience. If I have a coding question, I search for "turn list of tensors into tensor" or whatever but I'm really looking for SO or the pytorch documentation, and I'll ignore the geeksforgeeks and other seo spam that finds it's way it. It's almost like google is a statistical "portal" page, like Yahoo or one of those old cluttered sites were, that lets me quickly get through the menus by just searching. That's different from a blank slate search like we might have done 25 years ago.
I think what's really lacking now is uncorrupted search for anything that can be monetized. Like I tried to search for a frying pan once on google and it was unusable. I'm not sure any better search engine can fix that, that's why everyone appends "reddit" to queries they are looking for a real opinion on, again, because they are optimizing for the current state of the web.
Anyway, all that to say I think there are a lot of problems with (google dominated) search, but they are basically reflected in the current web overall, so just a better search engine, outside of stripping out the ads, can only do so much. Real improved search efforts need to somehow change the content that's out there at the same time as they improve they experience, and let us know how to, in a simple way, get the most out of it. I think google has a much deeper moat than most people realize