How to Write a Spelling Corrector

kylebgorman · on Dec 15, 2014

This is posted here about every three months.

Norvig's approach assumes a very small fixed vocabulary (so it won't scale well to the basic multilingual plane), and can only recover errors with an edit distance of two or less.

If you are interested in a vastly superior yet still very intuitive approach to spelling correction, I recommend the classic paper on noisy channel model corrector by Kernighan, Church, and Gale, which fuses the probability distribution the distribution over possible errors and the probabilities over words themselves.

http://www.aclweb.org/anthology/C90-2036.pdf

antics · on Dec 15, 2014

I actually think Norvig is making a totally different point than you are. I also think (with respect) that it's a much more important point to learn.

Where your point is that there is a better but almost-as-intuitive approach, Norvig's point is basically that a naive application of probability theory is often dramatically easier to implement, and also "good enough" when compared to a (usually much more complicated) rule-based implementation.

That's really a great lesson to learn. It's (IMHO) much less valuable to learn that there are better approaches out there -- of course there are.

joshuak · on Dec 15, 2014

Do not write a spell checker like this! I will hunt you down and beat you with a dictionary.

I find spell checking in almost every application that I've tried to be spectacularly bad for anything other than trivial typos. Google is quite good, but it is leveraging data not always available in other applications and is a special case optimized for search applications. Spell correcting for general writing is somewhat different in requirements and optimization vectors.

Spelling errors that can be classified in terms of edit distance, are far better characterized as "typos" not spelling errors. A typo is a transcription error, whereas incorrect spelling is a representation error. Both can be present in a particular ward, but they are different phenomenon.

People make incorrect spelling choices when they intentionally chose the wrong letters to represent phonemes, or to apply spelling rules incorrectly.

A better approach in all respects is to do what's often referred to as a grapheme to phoneme transformation, which is basically to `compile` a word in to a smaller set of characters representing the phonemes of the language. With the reduced set of symbols, that in and of themselves eliminate trivial typos, statistical models can preform better, and faster. Further, unknown words can often be corrected via the corrected spelling of the phoneme.

Why most spelling checkers fail to do grapheme to phoneme transformation when it is easy and both reduces the symbol set for statistical analysis and is demonstrably more accurate (with the possible exception of Indian languages) is beyond me. It's like watching people try to solve dehydration with a "better(TM)" formula for Coke. You're thinking about this problem wrong, just drink some water.

Ntrails · on Dec 15, 2014

I had always assumed that spellcheckers took into account likely mistypes based on keyboard layout when working on spelling correction.

For example, "RRROR" (error) could easily have been me fat fingering the keys, whilst "MRROR" seems far less likely unless I have truly gargantuan fingers or a really really small keyboard. So you might infer I'd missed the "i" instead of mistyped the "e".

Is that ever used as part of statistical analysis?

mkartic · on Dec 15, 2014

A "confusion matrix" is statistical analysis you're looking for. An entry (i,j) is a score on how much the ith character is mistakenly typed (inputted) as the jth character. This can be easily adapted to various input devices.

gcb0 · on Dec 15, 2014

probably if the spell checker owns the keyboard (eg a android keyboard app) and probably not on most desktop ones. and even if it does use that on desktop ones it's probably assuming a layout and missing, giving worse results

liotier · on Dec 15, 2014

Why not make the spellchecker keyboard-aware ? Just have it query 'xkb-switch' or 'xkblayout-state print "%s"' or whatever other API is available.

Doesn't any spellchecker use keyboard layout-aware Damerau–Levenshtein distance metrics ? https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_di... doesn't mention keyboard layout-awareness variants...

hurin · on Dec 15, 2014

It's not nearly as simple as it sounds in theory. Consider that you have typos which are from hitting the wrong button - and also typos from hitting the next letter before the previous letter. Also when looking at typos it's likely you'd find statistically significant differences for typos by hitting a letter to the left, or right, or prob. of typoing given the letters general position within a layout etc.

gcb0 · on Dec 16, 2014

> Just have it query 'xkb-switch' or 'xkblayout-state

said the other hundred of devs who now have to handle APIs to N OSes * M versions * X custom IMEI input solutions out there.

hacknat · on Dec 14, 2014

This: http://blog.faroo.com/2012/06/07/improved-edit-distance-base...

is an interesting attempt at fast spellchecking, for anyone that's interested.

tempodox · on Dec 15, 2014

The tech blogger's worst sin: not dating your article. This article starts out,

  In the past week, ...

but then it leaves no clue whatsoever as to when it was written. That mistake is made by almost every blogger out there. We programmers are the worst offenders when it comes to accelerating the pace of technological change. Yet our blog articles leave no hint as to their date, with rare exceptions.

I do surmise it is VERY bad form to post undated blogs to the interwebs, even more so in the realm of software technology.

As always, you are cordially invited to vote me down generously for not wearing pink glasses.

DanBC · on Dec 15, 2014

> Please don't bait other users by inviting them to downmod you.

https://news.ycombinator.com/newsguidelines.html

tempodox · on Dec 15, 2014

Thanks for the link, and the reminder. Sometimes, temptation is just too hard to resist.

darklajid · on Dec 15, 2014

Working with OCR data and rankings of likely matches ('fuzzy search'), which is closely related.

Yes, a neat demo of the general idea, but really, really naive. As the page states: Edit distances are restricted to 1 or 2 and that thing isn't exactly fast or efficient. Look at it, think about it, don't copy it. :)

biomimic · on Dec 14, 2014

Can you provide a side by side comparison with Google's spellchecker?

hnriot · on Dec 14, 2014

this is a very old page, it uses a very basic approach, it's highly computationally expensive to support edit distances beyond 2 (exponentially expensive) so the word length has to be kept small. This makes phraseslikethishardtobreakup.

Today's google spelling uses vastly more sophisticated approaches with gobs of data at it's disposal.

biomimic · on Dec 14, 2014

Have you or anyone else thought about, or imagined, applications for a spell checker that are not language-based spell checking?

kylebgorman · on Dec 15, 2014

Not quite sure if this is what you want, but the noisy channel model (call it "Bayesian" if you must) spell checker I linked to above is inspired by the classic approach to speech recognition (and codebreaking, and machine translation, and natural language parsing).

Norvig's algorithm essentially computes a censored form of Levenshtein distance. Dynamic programming algorithms for computing Levenshtein distance play a major role in gene sequencing, and also pop up in time series form as dynamic time warping.

pmoriarty · on Dec 15, 2014

Does anyone else get depressed when they see Norvig programming in Python instead of Lisp?

antics · on Dec 15, 2014

No. He has explicitly said that he does this because he thinks Python makes better pseudocode, and is easier for beginners to understand, as empirically measured by the questions he got from students of his AI books.

I understand your heart was probably in the right place, but honestly, your tone here is kind of hard to take.