I think the children's puzzler "Fuzzy Wuzzy was a bear, but Fuzzy Wuzzy had no hair. So Fuzzy Wuzzy wasn't fuzzy was he?" is better known despite the Urban Dctionary selection biased votes.
Not in the UK. I don't think the tongue-twister is well known here, but the phrase is used in the other sense in the TV show Dad's Army[1], which is still shown on the BBC. However, it's only used by one character[2], who's supposed to be 70 at the time World War II starts. So, I think it's a very outdated expression, and not likely to cause offence when used in the context of a programming library. I've certainly never heard the expression used other than in the TV show (where, as far as I know, it's still not censored - it's understood that the character's views were outdated even at the time the show was set).
The term is from a Kipling poem[3]:
"A derogatory term for a black person, especially one with fuzzy hair... From... one of Rudyard Kipling's... poems, written in 1918. The poem is in the voice of an unsophisticated British soldier and expresses admiration rather than contempt, although expressed in terms that sound patronizing today."
I did something similar for product matching across a Yahoo! store with products in a Amazon merchant account.
I had a set of products from Yahoo! that needed their equivalent product in a set of products from Amazon. I indexed all the Amazon products into Xapian and let the search functionality do its magic by using the Yahoo product title as the search keyword. It also had a scoring mechanism and worked flawlessly for my needs.
While reading this article I started laughing of amazement.(if that is even possible)
It is delightful to discover something you knew you wanted which is delivered to you free, courtesy of others.
I can remember the pain of doing this as a first year intern at a sporting odds aggregation site, the biggest challenge was dealing with the invalid xml and non standardized naming scheme. Montreal Canadians, The Habs, etc.
Our eventual solution was to use a trained matcher, but obviously it was not ideal since human intervention was required :(
Yeah completely non-standard names (like nicknames, abbreviations, acronyms) are a real pain to deal with, and string matching just completely fails on them. We (seatgeek) handle it the low tech way -- a giant list of name aliases that we run through during pre-processing. Not exactly worthy of a blog post, but it does the job well enough.
I did something similar, in Oracle and PostgreSQL, for a governmental entity. Its main purpose was to perform data fusion, where a set of not so dissimilar records represented the same person in several heterogeneous data sources. It was fun, because the concepts involved, but not so much because the syntactic sugar of the sql involved.
This looks pretty awesome. I remember when I was thinking about making a quote Website back in college. I had just learned about the Levenshtein distance algorithm in a class and was exciting about finding a real-life (re: non-contrived) scenario to apply it to.
Anyway, this looks like a really useful library. Glad it's freely available.
woah this looks really useful. Is there a gem for ruby that does this? I've just been doing the first 'String Similarity' step using levenshtein distance
> fuzz.partial_ratio("YANKEES", "NEW YORK YANKEES") ⇒ 100
From what I can see, this will also give 100 for 'NEW', 'KEES', 'YANK' - all of which could mean something completely different. How do they deal with this?
Context. We also know dates and times - more or less at least, there may be some conversion to UTC if necessary - as well as other information about the event - categories, locations etc.
On occasion there are false positives, in which case Our algorithm is the Borg. They will be assimilated. Their grammatical and syntactical distinctiveness will be added to our own. Resistance is futile.
Looks pretty useful! I wonder if a simple application of TF/IDF could improve the results by giving you better token weights. (Then you'd have to be comparing token sets, of course, rather than strings.)
See top entry: http://www.urbandictionary.com/define.php?term=fuzzy+wuzzy
Not to take anything away from the tech - that looks awesome and I can already think of a few uses for it.