Hacker News new | past | comments | ask | show | jobs | submit login

You don’t really need to store a full text crawl if you’re going to be penalizing or blacklisting all of the ad-filled SEO junk sites. If your algorithm scores the site below a certain threshold then flag it as junk and store only a hash of the page.

Another potentially useful approach is to construct a graph database of all these sites, with links as edges. If one page gets flagged as junk then you can lower the scores of all other pages within its clique [1]. This could potentially cause a cascade of junk-flagging, cleaning large swathes of these undesirable sites from the index.

[1] https://en.wikipedia.org/wiki/Clique_(graph_theory)




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: