Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: Can you borrow Googles search algorithm?
10 points by brittpart_ on Dec 14, 2020 | hide | past | favorite | 13 comments
Or do I have to start from scratch? Maybe relating to NLP.



More prosaic but if you have a blog etc. you can add a Google programmable search engine: https://programmablesearchengine.google.com/about/


You can also put a DuckDuckGo search box, works well for website search and supports DDG instead of Alphabet.

EDIT: Shameless plug if it gets the author to go with a non-google option https://www.gkbrk.com/wiki/DuckDuckGoSearchBox/


Mm... PageRank is the basis, but then recent publications suggest they use a multitude of factors (probably 7-10+?) to evaluate sites. PageRank is still more-or-less the biggest chunk, but since people learned how to game it, they've had to likewise step up their internal game on ranking. What you're requesting probably comes off as a trade secret, but you could probably get reasonable results using a PageRank inspired hybrid


The pagerank algorithm is public and there exist a lot of implementations, for instance https://networkx.org/documentation/networkx-1.10/reference/g...


I’m pretty sure it was patented at the beginning so it should be free by now


What are you trying to achieve? Search your website's blog posts? Create an e-commerce search engine? Job search engine? Web search engine? Just learn about how search tech works?

All of these are radically different domains, some of which requiring intensive NLP, others requiring other domains...


Funny thing I was thinking about recently...why not reverse engineer it? Run the top million most common queries, or maybe top billion, snapshot the top 100 results, use that to train your model. With cheap enough compute, can google be reverse engineered?


You'd have to feed your model a copy of the entire internet, and if you can do that, you've already done the hard part of creating a Google clone imo.

In general, if folks want to know how Google works, just do some reading on grey hat / black hat SEO. There is an entire (somewhat) underground industry of people that have ranking in Google down to a science - put exactly this on your page, set up exactly these linking domains with exactly this type of content, satisfying all of these metrics, etc. I honestly think the reason competing search engines are so much worse is just because none of them have tried very hard, or maybe because they just lack funding.

AFAIK, the algorithm is still the core of what it always has been (getting PR links to your page) but Google has just added a bunch of layers on top of that which basically check for things to disqualify you completely or make minor adjustments to your position in the rankings.



Golden, so helpful - thank you!


I would recommend to start with a mature search engine such as Lucene or ElasticSearch.


Forgive me if this doesn't make sense:

If I'm implementing search in an application and want to use NLP, do I need to train the search or are these solutions already ready to go? I'm not sure how other people do it/how search works/if you need to tell it what to do.


Well, of course the engine needs to have access to some corpus to search on. So the general answer to your question is: yes, however this step typically not called "training" but "indexing".

Most engines will repeatedly index your contents with crawlers or similar.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: