Hacker News new | past | comments | ask | show | jobs | submit login

This is the really hard way.

And it is going to fail A LOT.

Do this instead:

1. Contact a company that has a searchengine and therefore access to all your links. ( http://samuru.com ) springs to mind.

2. Do keyword extraction of those pages. Assume that anything that doesn't have any of the keywords of the page that is being linked to is a Bad link.

3. The ones that remain Google the keywords you extracted. (like 10 of the words) if the linking page doesn't appear in the top 50 results it is probably a Bad neighbor according to Google.

This method doesn't require NTLK, or Grammar checking. You can do it your self, and you are using Google to tell you if the site is on the Bad Neighbor list so you don't have to guess.




Your approach is going to have a lot of problems.

One of the most linked to page on the internet is the download page for Adobe Reader. It is definitely not spam but millions of those links aren't going to have "the keyword" on the page, so by your logic are bad links. This is an extreme example, but it is not an uncommon scenario.

Furthermore, if you have millions of backlinks, it becomes quite difficult to scrape Google (but you can use services like Authority Labs).


Why are you doing Bad Neighbor link checks if you make something like Adobe Reader?

You don't have to scrape Google they have an API for Search that is about $10 per 1000 calls at volume.

I have done this for BILLIONs of links.


You think the Adobe never built any bad links? I know many such large companies that are spendings hundreds of thousands of dollars a year or more buying links.

Do you have a link to the API, please? Thanks!


Custom Search, Don't specify any rules which would interact with the results you are testing against. (like add a rule that favors a parked domain.)

If they are buying those links then finding the bad ones is as easy as contacting the people they cut checks to.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: