Hacker News new | past | comments | ask | show | jobs | submit login

You are looking for a distance metric on strings. If d(s1, s2) = 0, where s1 and s2 are strings, then they are equivalent.

"Łódź" may be the same as "Lodz" to you in the same way that "komputor" may be the same as "computer" to a Russian speaker. To someone else the names "Anderson" and "Andersson" are equivalent. Now you see the problem -- exact matching is futile and you should use fuzzy matching instead, like normalized Levenshtein distance, and rank the results based on similarity.

Even that is not enough if you want to support non-Latin alphabets because they have different ideas about what a character is but it should get you started.




I shouldn't have to implement this from scratch though. It's not like this problem is unique to my program; programming languages should have some support for solving this kind of problem in their standard libraries (or a readily available library)




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: