A spellchecker used to be a major feat of software engineering (2008)

DanBC · on Dec 4, 2020

Here's an article that people might be interested in. It gives a bit more detail: https://web.archive.org/web/20100706052342/http://www.spelli...

I'm particularly interested in this one, and I'm curious about how useful something like this would be to use.

> The second does not use a dictionary at all (Morris & Cherry 1975). Like the previous method, it divides the text into trigrams, but it creates a table of these, noting how often each one occurs in this particular piece of text. It then goes through the text again calculating an index of peculiarity for each word on the basis of the trigrams it contains. Given pkxie, for instance, it would probably find that this was the only word in the text containing pkx and kxi (and possibly xie too), so it would rate it highly peculiar. The word fairy, by contrast, would get a low rating since fai, air and iry probably all occur elsewhere, perhaps quite often, in the passage being analysed. Having completed its analysis, it draws the user's attention to any words with a high peculiarity index. Like the previous method, it would fail to spot a high proportion of ordinary spelling errors, but it is quite good at spotting typing errors, which it was designed for. An advantage that it has over all dictionary-based methods is that it is not tied to English; it will work on passages of, say, French, German or Greek.

(The Morris there is Bob Morris).

jka · on Dec 4, 2020

Great reference. This highlights an interesting property of 'reasonably-solved' problems like spell checking and why they might be so much more straightforward now.

It's often not that we've developed groundbreaking algorithms that make solving the underlying problems intrinsically easier - the techniques we're using (like ngram modeling, in this case) may be the result of research work decades ago.

Instead the difference is that the fruit of that prior work has been implemented and made available in more accessible forms (libraries and source code) - and becomes easier to re-use, reason about, and modify thanks to abstraction and languages that have evolved to handle similar problems in a more expressible manner.

(upgrades in hardware and resources certainly help advancement too, but spell checking's probably a good example of a situation where an efficiently-designed implementation's likely to be noticeably more responsive, whether it's 1980 or 2020)

powersnail · on Dec 4, 2020

Consider that the most often typo I make is "ture" for "true", and "flase" for "false", I'd say this isn't going to catch some common mistakes.

warent · on Dec 4, 2020

From what they said above, it sounds like that's exactly the kind of thing it would catch unless you consistently wrote "ture" and "flase" many times in the same document

vageli · on Dec 4, 2020

> From what they said above, it sounds like that's exactly the kind of thing it would catch unless you consistently wrote "ture" and "flase" many times in the same document

The trigrams tur and ure are not that uncommon, nor fla or ase (though maybe they are less common in a programming context, by but even then it might depend on domain).

lowbloodsugar · on Dec 4, 2020

flag case future

OJFord · on Dec 4, 2020

'tur' & 'ure' must be pretty common (turnip, turn, turing, nurture, century; urethra, pure, cure, lure, endure) trigrams though.

powersnail · on Dec 4, 2020

From a trigram point of view, they do appear many times.

adrianN · on Dec 4, 2020

I have a bunch of iab in my vimrc for typos like that because I make them so often.

bjoli · on Dec 4, 2020

I have abbrev mode i Emacs for that. No more "teh" instead of the. I found most of them went away when I switched to Dvorak, since most of my errors like that are when the letters are in one hand. "teh" however became the all time high. 4000 corrections this year according to abbrev mode. And English isn't even the language I write the most.

NegativeLatency · on Dec 4, 2020

Care to share some? I use vim and have been meaning to try this for years

adrianN · on Dec 4, 2020

this is an excerpt:

    iab cosnt const  
    iab conts const  
    iab cnost const  
    iab costn const  
    iab retrun return  
    iab inclued include  
    iab inculed include  
    iab inlcude include  
    iab icnlude include  
    iab inculde include

xvector · on Dec 4, 2020

This is amazing, could you share the full list?

efitz · on Dec 4, 2020

If you just want to fix typos then it seems to me that Damerau-Levenshtein distance would be the best approach - a DL distance of 1 from a dictionary word is a high likelihood of a transposition or a missing or extra character.

WalterBright · on Dec 4, 2020

I've thought of a variation on that, if you're spell checking a large text file. Create an index of all the unique words in the file, along with how many each appears. Any unique words with a count of 1 are likely to be misspelled.

ivanhoe · on Dec 4, 2020

It'd probably be better to look for words with small Levenshtein distance that look like a case of typo, because people tend to miss-spell some words repeatedly, especially in larger texts.

geoduck14 · on Dec 4, 2020

A while ago, I was considering building a text editor that ran SQL for a very specific purpose at work. I mused about adding a syntax checker/suggested. I wonder if the above approach might be a better approach to a look up table.

RegW · on Dec 4, 2020

I wrote a spelling checker in the 1980's

In my first job I worked for Tasman in Leeds and produced a Word Processor for IBM PC compatibles in 8086 assembler with some help, and then a spelling checker.

For the spelling checker I did a whole load of analysis on a 70,000 word list from Collins and produced a list of tokens to represent common strings of letters. However, in the end I really had to cut the original word list down to get the whole thing onto a single 360K floppy.

After I left Tasman, I was lying in bed one night still thinking about it and realised where I had gone wrong. The tokenising thing, which someone else had put me onto had blinded me. I had stared at word lists for months and hadn't pinned down the obvious pattern. All but 26 words in the 70,000 word list shares the bulk of their characters with the word before it.

So the solution is use 5 bits of the first byte as a count of chars from the word before, 3 bits indicate commons enddings (ship, s, ing) or that the following bytes are tokens for the rest of the word. With this I got the word list compressed to less than 2 bytes per word.

I took this back to Tasman. They put all 70,000 words and the spelling checker onto a 175K floppy for the ZX Spectrum +3.

[Edited for typos]

greenkey · on Dec 4, 2020

That’s awesome!

In comparison, here a quote from the OP’s blog entry:

“Fast forward to today. A program to load /usr/share/dict/words into a hash table is 3-5 lines of Perl or Python, depending on how terse you mind being. Looking up a word in this hash table dictionary is a trivial expression, one built into the language. And that's it. Sure, you could come up with some ways to decrease the load time or reduce the memory footprint, but that's icing and likely won't be needed. The basic implementation is so mindlessly trivial that it could be an exercise for the reader in an early chapter of any Python tutorial.

That's progress.”

But is a simpler, less efficient method progress? Sure it allows more words to be added/removed with ease, and I don’t want to advocate over-optimization, but the solution you made for the Spectrum seems better because words don’t change much. Why don’t we use a similar specialized hash and compressed dictionary format to increase spellchecking speed and allow more words in less space? We could still produce that format using /usr/share/dict/words and similar.

RobAley · on Dec 4, 2020

> Why don’t we

Because we don't need to and we have much more interesting problems to take up our time.

darkwater · on Dec 4, 2020

But GP already solved the problem (at least for English and other Latin script languages). Why throw away those findings?

_lffv · on Dec 4, 2020

Problems tend to have more than one solution. GP's solution should be documented, yes, but the alternate solution that won out was computers being capable of storing a million words or so in plaintext very easily, and doing the same using their compression scheme just isn't really worth the space saved nowadays.

ASalazarMX · on Dec 4, 2020

Also, compressing could actually be slower for modern computers. Remember when compressing your hard disk made your PC faster, up until disks became faster, then it actually made it slower?

Today's CPUs are very fast, so the trend could have flipped again, that would be an interesting benchmark.

RobAley · on Dec 4, 2020

Implementation takes time. Keep it simple and move on.

nenadst · on Dec 4, 2020

I always thought that we still use a trie or (to save memory) ternary search trees for that..

greenkey · on Dec 4, 2020

How many operations and objects? The method he’s talking about would seem more efficient for the purpose vs all of the strings still being created even if never used in the plain hash version.

warpech · on Dec 4, 2020

Thanks for this nice detailed memory!

How did you get the word list from Collins? Did they license it in a digital form?

selfhoster11 · on Dec 4, 2020

They could have been non-copyrighted at the time. Database rights didn't apply in the UK until 1998, so I think it would be fine to have an intern type in just the words into the computer and not infringe on anything.

RegW · on Dec 4, 2020

The list was used with permission.

I don't know/remember the terms under which this happened. I do remember that when I was trimming it down I found the list contained the trademarks of a competitor. These were removed.

brianmcc · on Dec 4, 2020

What an awesome insight, thanks for sharing. I love hearing these kinds of "unobvious" solutions. Simple and elegant but all too easily missed!

nafey · on Dec 4, 2020

Wouldn't using something like trie be useful here?

ignoramous · on Dec 4, 2020

You may be interested in any or all of these:

Minimal Acyclic Finite State Automata: http://stevehanov.ca/blog/?id=115 (related: https://web.archive.org/web/20120302104036/http://siganakis....)

Succinct Tries: http://stevehanov.ca/blog/?id=120 (related: https://alexbowe.com/succinct-debruijn-graphs/)

Compression and completion using GPT-2: https://bellard.org/textsynth/index.html (related: https://ed-von-schleck.github.io/shoco/)

Search and compression with Finite State Transducers: http://blog.burntsushi.net/transducers/ (related: https://swtch.com/~rsc/regexp/)

Collection of succinct string representations: https://github.com/simongog/sdsl-lite (related: http://pizzachili.dcc.uchile.cl/)

Search with compressed Radix Trie: https://cr.yp.to/critbit.html related: http://reports-archive.adm.cs.cmu.edu/anon/2020/CMU-CS-20-10...

RegW · on Dec 4, 2020

In the case of the ZX Spectrum +3 there wasn't enough disk or memory to hold the uncompressed word list, nevermind some form of tree structure.

The search method involved uncompressing each word (one at a time) until either the word was found or one that should follow it.

However, I did intend to use a binary search of the word list on the floppy. I arranged it so any zeroes in the list indicated the start of a word from which decompression could start. Under maximum compression there would be only 26 zero bytes in the list, but by selecting short words at regular intervals I could spinkle zeros throughout the list (approx 1 per block). A binary search could scan for the zeros, decompress the associated word and find the section that should contain the target word.

Tasman didn't go for this. They sorted all the words to be checked in memory, then opened the file and did a complete scan.

Const-me · on Dec 4, 2020

A good spell checker is still a hard engineering problem, despite the hardware progress.

Just a hash map ain't gonna work. The only reason spell checking is perceived as a solved problem is availability of libraries. Here's an open source example https://github.com/hunspell/hunspell way above 10k lines of code.

I speak 4 languages, and in my experience what's in Microsoft Office is the best one I used so far.

oneearedrabbit · on Dec 4, 2020

I suppose, it is worth giving a shout-out to a recent Hunspell port to Python by Zverok: https://github.com/zverok/spylls, this description from Github sums it up nicely:

> Hunspell is a long-living, complicated, almost undocumented piece of software, and it was our feeling that the significant part of human knowledge is somehow "locked" in a form of a large C++ project. That's how Spylls was born: as an attempt to "unlock" it, via well-structured and well-documented implementation in a high-level language.

It's incredible how much work has been done (along with documenting algorithms!) in this one-man project.

indeyets · on Dec 4, 2020

Exactly. It's a very important argument against calling it a solved problem. I believe the problem with hunspell is that it's very hard to get into fixing it (it's good, but it's not perfect) and spylls makes it feasible again

jiggawatts · on Dec 4, 2020

To be fair, Hunspell was written specifically for Hungarian because it's such a difficult language to handle efficiently in a spell checker. It just happens to also work for other languages.

If you only need English, the complexity of Hunspell is not required.

I still haven't found a decent Hungarian spell checker, they get confused by rare words that have the same letters as a very common word, but different accents.

E.g.: The word "és" means "and", which makes it much more common than "es", which is the word root for "fall" and is rarely used without a suffix like "esett" or "estek".

I'm yet to see a spellchecker that's smart enough to fix "something es something" but not "le es".

unpopularopp · on Dec 4, 2020

>but not "le es"

First of all it's a verb with -ik ending so you can't write "es", the root lexical word is "esik".

Second you never use anything like "le es" because generaly you always have to use the verb and prefix together ("leesett", "leesik" etc.) unless you use a commanding form ("ess le") or a modal verb ("le akart esni, "le fog esni" etc.)

"le es" is strictly wrong because it's the wrong verb (should be "esik") and you have to write together ("leesik") so obviously a spellchecker picks that up.

jiggawatts · on Dec 4, 2020

Okay, I admit, my Hungarian is not so good these days.

But if you're right (seems like you are), then why do spell checkers not pick up on "es" as misspelled?

unpopularopp · on Dec 4, 2020

Most likely because it's used as a suffix too.

Mostly for numerals:

"I arrive with the train at 7" [as in 7 o'clock] = "A 7-es vonattal érkezem"

"M7 motorway" = "M7-es autópálya"

"50m² room" = "50m²-es szoba"

But it's also used with foreign names and full names.

Aissen · on Dec 4, 2020

If french is one of your language, state of the art has been Druide's Antidote for more than 20 years. And it has had a Linux version available for almost 15 years, too.

jeltz · on Dec 4, 2020

It is similar for Swedish. As far as I know nobody has surpassed stava despite it being about 25 years old.

vbezhenar · on Dec 4, 2020

I would argue that it's not a hard engineering problem. It's just a problem that should be solved. Linguists made a lot of formalizations about language. There are rules, there are known lists of exceptions, there are dictionaries. You just have to implement those things carefully with lots of edge-cases, implement it for every language, but it's just mundane engineering work, nothing that requires breakthrough.

Currently spell checking just checks words from a dictionary. Its 1% of work. Or even less.

Microsoft Word used to have awesome spell checking, grammar checking, etc back in 2000 for Russian language. It seems to be degraded since then. But the fact is, this problem was solved. It checked for spelling, for grammar, for punctuation. I would expect that kind of functionality working in every OS textbox by now.

anthk · on Dec 5, 2020

>You just have to implement those things carefully with lots of edge-cases,

Say hello to homophones, ambiguous phrases and double meaning.

In Spanish a lot of verbs share the singular first and third persons. "Estaba" can mean "I was" or "He/she was.". Or "fuimos", (we were); but also "we went".

lacker · on Dec 3, 2020

Google's spell checker still is a major feat of software engineering ;-) It must handle every language, handle new words constantly being coined, have low enough latency and high enough throughput to run it on every web search. (Well, caching probably solves 2/3 of traffic.)

jeroenhd · on Dec 4, 2020

While it's not bad, it's not that great a universal tool either. My native language makes use of a lot of compound words and Google's spell checker often gets confused when I combine words according to the standard grammar.

I can see and understand the technical limitations, but tools like Microsoft Word seem to do a much better job than Google's spell checker, even in things like Google Docs.

Google search will often suggest splitting up words and sometimes even does it transparently, which can give entirely wrong results because suddenly Google matches words across a sentence instead of specific compound words. It's kind of frustrating to have to resort to quotation marks for some single-word search terms.

I get the feeling Google's spell checker doesn't check spelling, it just tweaks the input until it manages to find more results. Not quite the same, because a lot of "fixes" often have entirely different meanings in my experience.

lacker · on Dec 4, 2020

Yeah, exactly! - this is why building the perfect spell checker is still a challenging and interesting engineering problem, in 2020. The nature of misspelling varies from language to language, especially with non-letter-based languages like CJK, or with input methods that lead to different sorts of typos than a regular keyboard.

Out of curiosity, what is your native language, Dutch? Can you give an example of something that Google’s spell checker screws up?

berkes · on Dec 4, 2020

I'm Dutch, and here are some funny things suggested wrong by spellcheckers:

* hoofdaannemer - hoofd aannemer : main contracter - person who takes heads

* wegomlegging - weg om legging : road closure, a road around a legging(pants)

* dagverse vis - dag verse vis : catch of the day - goodbye, fresh fish

This twitter has many such errors in the wild.

https://mobile.twitter.com/spatiegebruik

Not all spellcheckers suggest it wrong, but almost all that rely on statistics, will get it wrong: hoofdaannemer (or hoofdanything), will occur far less than hoofd or aannemer.

This, I fear, is forming our language use. As people rely on tech, they'll follow the suggestions and wrong use of space will turn an actual problem.

Edit: formatting.

kiwijamo · on Dec 4, 2020

German is another one that seems to match the description.

ath92 · on Dec 4, 2020

This exactly. I've often wondered if the default spell checkers that come with phones nowadays are slowly changing the language itself. If Google doesn't fix these issues in the nearish future, an entire generation will grow up with a device that tells them the spelling they learnt in school is wrong.

And with so many people just accepting Google's anti-compound spelling as correct, many people will be mostly exposed to a version of their language with more spaces, and it'll start to "feel" right to them.

not2b · on Dec 4, 2020

Well, yes, it has to work that way to handle searches for new companies that chose their name based on some non-word with an available domain. If someone types "nvidea" they probably mean "nvidia", and Google doesn't handle that with a dictionary. Same for the handles of social media celebrities.

SulfurHexaFluri · on Dec 4, 2020

I find googles spell checking to be the best in the world. Google search is where I go when local spell checkers failed.

zinekeller · on Dec 4, 2020

I dunno, really. It is much better than Word on English, but it is laughable when I jumble several languages on a single document, and doesn't handle some laguages that relies on code-switching. At least Word knows its boundaries.

sgc · on Dec 4, 2020

Their google search spellchecker is good, but it's ironic that their android / keyboard spell checker is quite literally the worst I have ever used. Years in and it's maybe 5% as accurate as aspell.

matsemann · on Dec 4, 2020

I feel it's more statistics based than true spellchecking? At least I often feel that it wants to rewrite my sentences to something completely different because what I'm searching/writing is uncommon but similar to a more common word.

google234123 · on Dec 4, 2020

It's also present in auto complete

lacker · on Dec 4, 2020

True - that's even more requests per second than search is!

adamjb · on Dec 4, 2020

If you haven't seen it already you should check out Peter Norvig's 20-odd line toy spell checker, written over the course of a flight.

https://norvig.com/spell-correct.html

Loranubi · on Dec 4, 2020

It's a nice demo and a good tutorial if you are interested in spell checkers. But imo it did more bad than good. A lot of libraries started to implement it. But it's quite horrible in performance.

pbreit · on Dec 4, 2020

What is better?

thangalin · on Dec 4, 2020

SymSpell is about 1 million times faster than Norvig's algorithm.

https://github.com/wolfgarbe/SymSpell

WalterBright · on Dec 4, 2020

Here's D's spell checker, tests included!

Edit: forgot the link https://github.com/dlang/dmd/blob/master/src/dmd/root/spelle...

qzw · on Dec 4, 2020

Did you forget the link or am I missing a joke about D’s spellchecker?

bombastry · on Dec 4, 2020

I believe that this may be the missing link (or equivalent):

https://github.com/dlang/dmd/blob/master/src/dmd/root/spelle...

This is a spell-checker in the D reference compiler that detects typos in identifiers and suggests potential corrections for those typos.

ww520 · on Dec 4, 2020

Link? I'm curious.

flohofwoe · on Dec 4, 2020

...and yet the first thing I switch off in every tool is the spellchecker, because they get utterly confused with the mix of English, German and my local dialect that I'm using when communicating with different people. I'd say doing a spellchecker "right" is apparently still a major feat of software engineering ;)

Avamander · on Dec 4, 2020

I wish it just struggled with the mix of multiple languages, it can't even handle my native one. A plural compound word in translative case with a clitic? I can only dream about it, it can't even do all the cases reliably.

Aachen · on Dec 4, 2020

What language is that?

I guess what I'm wondering is how obscure it has to get (or perhaps how low the GDP of the people speaking a language has to be) before custom spell checking rules aren't considered worth it to bigcorps anymore. Though I'm also kinda interested in what this weird language thing is you're talking about.

hoseja · on Dec 4, 2020

Based on the features; Hungarian, Finnish, or Estonian.

vesinisa · on Dec 4, 2020

Microsoft has an okay Finnish spellchecker. It's really a shame no Finnish university has developed an acceptable open source Finnish spellchecker though AFAIK.

Word-list based checkers actually work for English, which gives an impression that they would work for all types of languages. For agglutinative languages (which I think is about 50% of world languages) not so much, since the list of valid words is basically infinite.

But combined with a high-quality root word list it's possible to write a computer program that returns if a given input word is a valid form of some known root word, i.e. a spellchecker.

In English, you can think the problem same as writing a spellchecker that can tell if an input word is a valid chemical name for a molecule, like 1,3,7-trimethylxanthine (caffeine) or adenosine-5'-triphosphate (ATP). Clearly, we can not write an exhaustive list of valid words here. But it's pretty easy to write a computer program that can tell if the input word at least seems like a valid chemical name devoid of typos.

afiori · on Dec 4, 2020

Could be Italian, where there are words like `portatemelo` ~ `you bring that to me` that google routinely gets wrong.

donbrae · on Dec 4, 2020

I often write in a mixture of Scots and English, and spellcheckers are nae guid for that use case. The problem is solved, however, on iOS and Android thanks to SwiftKey[1].

[1] https://dsl.ac.uk/our-publications/scots-predictive-keyboard...

possiblelion · on Dec 4, 2020

+1 to SwiftKey. I use a mix of English and Estonian and SwiftKey handles it amazingly. Almost never suggests the wrong language when I start typing.

Honestly I can't use the default keyboard in iOS anymore because it lacks multilingual support. How it still isn't a thing in 2020 boggles me.

jamil7 · on Dec 4, 2020

Yeah, to add to it a lot of Germans will use American English but some British English and my spellchecker can't handle the mixture. Off topic but despide how difficult German has been to learn I've found spelling in German to be far more predictable and straightforward than my native language English.

coopierez · on Dec 4, 2020

This is at least partly because all of the German speaking countries got together in 1996 to reform the spelling of the language.

See here: https://en.wikipedia.org/wiki/German_orthography_reform_of_1...

jamil7 · on Dec 4, 2020

Right, I vaguely remember hearing about this. Another benefit is you usually have a fairly good idea how a word is pronounced, which is not the case for English or French in my exprience. In some ways I guess American English attempts to standardise and fix some rough edges of English.

Symbiote · on Dec 4, 2020

It tried, but I'm not sure how much it succeeded.

American English changes a few cases, but there are often exceptions (color, favor, but four, glamour; -ize, but still advertise, compromise etc).

There are too many differences for me to summarize (American and Oxford English spelling there), but I think the problems with English spelling are much deeper than American English can fix.

https://en.wikipedia.org/wiki/American_and_British_English_s...

Typhon · on Dec 4, 2020

This is a rather bad article because it completely misses the real complexity of a spell checker.

A spell checker is not simply a list of words, it's a way to check mistakes according to a standard and to point towards ways to fix these mistakes. This not reducible to a look-up in a hashtable. It requires taking into account some complicated things about the definition of a word and the context in which it is written. You might think that's grammar checking but the boundary is not clear and in any case, any language processing application starts with tokenizing and deciding what counts as words and on what basis.

What is a word even ? Is "CIA" a word ? What about "C.I.A." ? What about C (as in the language) ? What about c (as in the speed of light) ? 2,4-Dinitrophenylhydrazine ? How does the spellchecker handle dashes and apostrophes ? What about proper nouns ?

Really, the example is poorly chosen.

svachalek · on Dec 4, 2020

The feature creep is irrelevant to the article. In 1984, a spell checker could be just a list of words and would still be a major engineering challenge.

dingaling · on Dec 4, 2020

And what about the term "spell checker" itself. A routine that checks the validity of witches' spells?

That it can't even identify that it itself is a "spelling checker" illustrates the problem.

bluedino · on Dec 4, 2020

Had you used any spellcheckers from that era?

musicale · on Dec 4, 2020

Arguably detecting typographical (or transcription) errors is still non-trivial today since a) edit distance is NP complete and b) selecting the correct spelling often depends on grammar as well as semantic context.

For example, consider the erroneous phrase "he was put through the ringer." Although "ringer" matches a spelling in the dictionary, it doesn't make sense semantically (a "ringer" being a device that rings bells, a near-duplicate of something else, etc.) and the proper idiom is "put through the wringer" (since a wringer is/was a device to squeeze water out of a wet mop or wet laundry. Squeezing someone through a pair of rollers is particularly evocative.)

Although you do see "nerve-wracking" or "wracking" (i.e. wrecking) one's brain, the more traditional "racking" (literally to torture by stretching on a medieval rack) seems more appropriate (although the term "nervous wreck" is common.) Shakespeare may have exploited the pun of "wrack" vs. "rack," so perhaps we can also.

"Security breaches" and "security breeches" sound alike but have somewhat dissimilar meanings. Network and system administrators might consider donning the latter in preparation for the former.

jiggawatts · on Dec 4, 2020

I often make this mistake in technical documentation: "the database sever was updated".

However, a surgeon might want to write "the next procedure is to sever the artery."

I wonder if GPT-3 could be used to determine the "context" and determine the spelling correctness "weights"?

meetups323 · on Dec 4, 2020

Yes. Up until last two sections is prompt, then I let GPT-3 fill in the responses.

Non-standard English: Please provide me with a short brief of the design youre looking four and some examples or previous projects youve done would be helpful. Standard American English: Please provide me with a short brief of the design you’re looking for and some examples or previous projects you’ve done would be helpful.

Non-standard English: If Im stressed out about something, I tent to have a problem falling asleep. Standard American English: If I’m stressed out about something, I tend to have a problem falling asleep.

Non-standard English: There are plenty off fun things too do in the summer when you are able two go outside. Standard American English: There are plenty of fun things to do in the summer when you are able to go outside.

Non-standard English: She didnt go to hte market. Standard American English: She didn't go to the market.

Non-standard English: The database sever was updated. Standard American English: The database server was updated.

Non-standard English: The next procedure is to sever the artery. Standard American English: The next procedure is to sever the artery.

(In case you're wondering, if I provide 'server' as the input for the second case, it replaces it with 'serve'. Which is reasonable. I tried changing the wording to coax it into placing sever but didn't have much luck)

gnramires · on Dec 10, 2020

Indeed, I think most linguistic tasks could benefit from the neural (transformer) approach, even a tiny spellchecker. The combination of semi-regular rules, semantics and logic makes a highly effective "manual" solution too expensive to obtain (in terms of research and programming time).

afiori · on Dec 4, 2020

I don't think I ever made that specific mistake, but (as a non native speaker) I spent a couple minutes staring at your example and was about to write a comment asking what mistake you were talking about before I spotted the typo.

mattkrause · on Dec 4, 2020

Can't edit distance be computed in O(mn) time/space with dynamic programming?

zingermc · on Dec 4, 2020

Yes, at least for Levenshtein-style edits (substitution, insertion, deletion).

Although, I suppose a spell checker algorithm would be O(mnd) where d is the size of the dictionary, because it needs to compute O(mn) edit distance for each of the dictionary words.

emmanueloga_ · on Dec 4, 2020

I was thinking the same, I remember coming across [1] as a possible solution, but far from trivial!

1: https://en.wikipedia.org/wiki/Levenshtein_automaton

opsunit · on Dec 4, 2020

This isn't in any way a criticism, but I found it really amusing that your post contained almost exactly the error you were describing ("hat") before correction.

est31 · on Dec 3, 2020

Our capability of doing simple spellcheckers has maybe improved, but our standards have risen as well. Back then it might have been amazing, nowadays it's boring and every program has it, even my Firefox browser does.

A spellchecker is only a limited solution to the task of proofreading. I've never used the grammarly product, but at least from the ads it seems it can catch many things that human proofreaders would catch. There might still be domain-specific things that requires humans.

But the grammarly product has way more logic behind it than just using a simple table of words. No idea if it's parametric or uses an ML language model, or maybe a combination, but the feature set wouldn't be possible with just a table.

So instead of a lone employee being tasked to write a spellcheker, you suddenly have an entire industry focused on NLP. That's how the growth of the computing industry looks like :).

sam_bristow · on Dec 4, 2020

Grammarly is a massive memory/cpu/bandwidth hog!

Word/Outlook 2016 now have EXTENSIVE but DEFAULT OFF grammar and sentence styling assistance.

File > Options > Editor Options > Proofing > Writing Style.

Ref: https://twitter.com/SwiftOnSecurity/status/97440857301385216...

emmanueloga_ · on Dec 4, 2020

The article on another thread [1] wrought up "Andy and Bils's Law":

"For every cycle a hardware engineer saves, a software engineer will add two instructions."

I thought of that reading last paragraph:

"Fast forward to today. A program to load /usr/share/dict/words into a hash table is 3-5 lines of Perl or Python, depending on how terse you mind being."

1: https://news.ycombinator.com/item?id=25285862

rimliu · on Dec 4, 2020

    Spell chequer by Martha Snow


    Eye halve a spelling chequer
    It came with my pea sea
    It plainly marques four my revue
    Miss steaks eye kin knot sea.

    Eye strike a quay and type a word
    And weight four it two say
    Weather eye am wrong oar write
    It shows me strait a weigh.

    As soon as a mist ache is maid
    It nose bee fore two long
    And eye can put the error rite
    It's rare lea ever wrong.

    Eye have run this poem threw it
    I am shore your pleased two no
    It's letter perfect awl the weigh
    My chequer tolled me sew.

WalterBright · on Dec 4, 2020

I especially like the spellchecker added to the D compiler. The neato feature is the "dictionary" is the part of the symbol table that is in scope. In my usage it guesses right about 50-75% of the time.

I've been considering adding one to my text editor. I found out I'm not as good a speller as I thought I was before spellcheckers :-/

corytheboyd · on Dec 4, 2020

> I've been considering adding one to my text editor. I found out I'm not as good a speller as I thought I was before spellcheckers :-/

We all make mistakes and it’s okay, there is no shame in it! Let the computer fill in the gaps for you so you can use your brain for other things :) I shamelessly use a spell checker in my IDE and it has been a very positive experience, highly recommend!

noneeeed · on Dec 4, 2020

My spelling has always been bad. I'm not dyslexic, but I did get extra help as a kid. In theory spellcheckers should be great for me. But in reality I seem to make spelling mistakes that leave some checkers unable to make useful suggestions. I can be just one or two vowel substitutions away from the correct spelling, but they will fail to suggest the right answer.

I should start keeping notes when that happens as I'd really like to understand what I'm doing that causes them issues, it might help me when I'm struggling to spell a word and get stuck.

staplung · on Dec 4, 2020

Really, there are two components to a modern spell checker. First, identify the words that are misspelled. Second, offering (good) corrections. This article is really talking about how hard even the first task was when memory was scarce. Offering good suggestions is still non-trivial.

tgv · on Dec 4, 2020

Third: identifying words that appear in the dictionary, but are not used properly. And that's not looking at compounds and agglutinative languages. But there's no need to look that far: capitalization isn't even 100% solved. The article does software development and linguistics a bad service by suggesting spell checking is just loading a dictionary into memory.

itslennysfault · on Dec 4, 2020

Thank you. This is the perfect way to word this. My blood was kinda boiling reading the article because I've implemented a spell checker and it's sure as hell not trivial.

However, my experience is exactly what you described. I got it to identify / highlight misspelled words in an afternoon, but then I came to suggesting the correction and things got hairy quick.

Also, my spell checker wasn't even "smart" (aware of semantics, context, homophones, etc).

ape4 · on Dec 4, 2020

I made toy spell checker. My way of making suggestions was to take a misspelt word (ie not in the dictionary) and permute some letters - look up the new words - if it was in the dictionary - offer those as suggestions.

dathinab · on Dec 4, 2020

Given how terrible bad most spell checkers are I would say it's still a major fear in 2020.

(Through it's a feat of properly incorporating linguistic knowledge into your spell checker.)

It just happens that for the English language terrible bad spell checkers are often still good enough.

TwoBit · on Dec 4, 2020

Ralph Gorin wrote the first practical spell checker in 1970. https://web.stanford.edu/~learnest/les/spelling.htm

sabalaba · on Dec 4, 2020

Interestingly enough, this video from the 1980s shows Brian Kernighan writing a one line spellchecker program in UNIX shell. Obviously, the computer that it’s running on is more powerful than a 256K PC. The point stands: some people are simply living in the future.

https://youtu.be/tc4ROCJYbm0

(Shell coding starts at 8:40)

abetlen · on Dec 4, 2020

They really were. Here's a version of that video that includes Lorinda Cherry writing a talking calculator by piping together 3 unix commands.

https://youtu.be/XvDZLjaCJuw?t=828

Isamu · on Dec 4, 2020

One of the early implementations of Unix spell used a Bloom filter to compress the word list:

“Two different hashing methods have been implemented. The first, based on a simple superimposed code scheme first proposed by Bloom,9, 10 was supplied by D. M. Ritchie and succeeded in encoding a 25,000-word list into 50,000 bytes. A more elaborate method, in which values of a conventional hash func- tion are represented in a differential Huffman code, squeezed 30,000 words into 52,000 bytes. The stop list is handled by the same method in a different process.”

Doug McIlroy discussed the history of spell at Bell Labs here: https://www.cs.dartmouth.edu/~doug/spell.pdf

I remember emailing Doug to ask him about this and he was great, very helpful to me. Thanks Doug!

throw93 · on Dec 4, 2020

I was recently tasked with finding alternative to our age old spellchecker. Main criteria was it should be a separate api in cloud(we are not completely in cloud yet). Also I couldn’t use any commercial cloud spell checkers because of data privacy and security. After putting up a demo with an open source spellchecker we ultimately decided to abandon the effort. We found out that most of our competitors were recommending their users to use browser provided spellchecker like the one provided by chrome. We followed the same suite because we felt that being a relatively small development team our time was best spent writing business logic than to to create and maintain a spellcheck service.

bobbiechen · on Dec 4, 2020

Another great example of this is Key Word In Context [1], which is described in this paper [2] as being possible to implement in a week or two.

My professor (Josh Bloch, in his API design class at CMU) showed us this paper, then showed us his quick implementation using standard Java APIs in less time than the rest of the lecture.

[1] https://en.wikipedia.org/wiki/Key_Word_in_Context

[2] https://prl.ccs.neu.edu/img/p-tr-1971.pdf

dang · on Dec 4, 2020

If curious see also

2015 https://news.ycombinator.com/item?id=10789019

2012 https://news.ycombinator.com/item?id=4640658

2012 https://news.ycombinator.com/item?id=3466388

Discussed at the time: https://news.ycombinator.com/item?id=212221

jshprentz · on Dec 4, 2020

The venerable Ispell program dates back to 1971 with the most recent update in 2015 [1]. Its Contributors files traces earlier spell checkers back to 1959 [2].

[1] https://www.cs.hmc.edu/~geoff/ispell.html

[2] http://web.mit.edu/~mkgray/jik/sipbsrc/src/ispell-3.1/Contri...

solresol · on Dec 4, 2020

I don't want to brag, but it wasn't that hard to write a spellchecker in 256KB in 1984.

I was in primary school (age 11) and we had a Microbee at home with 64KB of RAM and a copy of Turbo Pascal. My sister is/was dyslexic so with a bit of help from my older brother (who would have been in first year of computer science at the time) I wrote a spellchecker for her. It think it might have been a Christmas present, so I had probably been programming for less than a year at that stage, and only in the afternoons after I got back from school.

It read the whole document into memory (not the English dictionary, which was too big) in a tree structure. Pointing out that I could use a binary tree and teaching me how to use pointers was my brother's contribution. Then it read through an on-disk copy of the English dictionary. It wasn't a complete dictionary, so I didn't do any compression on the storage, and then asked the user (my sister) to review each word that wasn't in the dictionary, sending any issues out to the printer with a line number and the word in bold.

So if a kid with very little experience could do it, professionals would have had little trouble back then.

Writing a spell-checker on the computer we had before that though (the ZX-spectrum with 16KB of RAM and the only bulk storage being a cassette tape)... that would have been hard. We've definitely progressed since then.

SulfurHexaFluri · on Dec 4, 2020

A spell checker isn't something in a binary state of working or not working. Its something that has the almost impossible task of working out what the user wanted and not what they asked for.

Maybe most of your problems are a single letter mistake or a keyboard slip up but where good spell checkers shine is they know what you want even when you are miles off. I find googles spell checking to be exceptional at understanding the mapping between how a word sounds like and what it actually is even when they share very few letters in common.

An easy example of what I mean is if the input is "shivon" and the spell checker is able to correct this to "Siobhán" because it knows this is how users try to spell it when they have no idea. A simple algorithm isn't able to do this because there are no logical rules of english to follow here, you would likely need a massive amount of user data to train on to solve this test case.

solresol · on Dec 4, 2020

Of course, I wouldn't dream of saying that spelchek.pas (as I think I called it because 11-year-old me thought that was hilarious) was state of the art at the time, nor would it be sufficient for any purpose today. But it solved the core of the problem: identifying words with a likely mis-spelling, which is all the original article was talking about.

sorokod · on Dec 4, 2020

There are approaches that take into account how words sound like e.g. metaphone

wolfgarbe · on Dec 4, 2020

I get the point of the article that today's improved hardware allows things that weren't possible before.

But neither spell checking nor its more advanced siblings spelling correction or automatic spelling correction is solved. Here is an example:

"I want to by Apple"

"I want to be Apple"

"I want to buy Apple"

Because "by" is a genuine English word, a spellchecker based on a simple hashtable will not detect the error, which obviously exists. We need to look at the context of the whole sentence or even document. Today we can use deep learning and word embedding to solve/improve that problem.

A simple spelling corrector that is based on edit distance would always prefer "be" over "buy" because "be" is the more frequent word of both.

So the risk of all the easy accessible libraries for a certain task is, that you think the problem is solved, but you overlook many problems because the topic is much more diverse and deep than you would think at the first glance

Another point is that while hardware improved, also the expectations have risen. We expect milliseconds latency instead of seconds, we often require the software to work in a multi-tenant architecture with thousands or millions of concurrent users.

So even for seemingly simple and straightforward tasks, and despite all the exciting advances in hardware, always new challenges present themselves.

MHard · on Dec 4, 2020

That would be a grammar checker though. And there are many solutions available today to do that. Even open source solutions. Grammarly would be one example, language tool another.

The ones I looked at work by encoding common spelling mistakes into a grammar (as in Chomsky grammar) and then running that over the text.

wolfgarbe · on Dec 4, 2020

No, that's not grammar checking, but spelling mistakes that turn out to be real words!

If you misspelled "college" as "collage", or you misspelled "three" as "tree" the word you typed incorrectly happens to be an actual word itself! Correcting these types of errors is called "real word spelling correction".

Real-word spelling errors are words in a text that, although correctly spelled words in the dictionary, are not the words that the writer intended

https://www.researchgate.net/publication/221628953_Real-Word...

And even if there are several approaches evaluated, the problem is far from solved.

pbhjpbhj · on Dec 4, 2020

It's so weird that spellchecking isn't context sensitive: both locality and use.

Locality: The pair "buy Apple" must be vastly preferred, like millions of times more.

Use: I'm writing this and my swipe keyboard offers "spellcasting" when I want "spellchecking". The page I'm on is about spellchecking and the other word isn't one I've ever used until now. You can split these down in to prior-use and use-context, I guess: the former is most annoying, always having to correct in the same manner.

dougb5 · on Dec 4, 2020

Fast forward to today. A program to load /usr/share/dict/words into a hash table is 3-5 lines of Perl or Python, depending on how terse you mind being. [...] That's progress.

I was curious -- This Python script consumes 9.5 MB of RAM on my Mac, which is a whole lot of 1980s-era PCs. Sure, in most cases this one decision to use a terse but unoptimized data structure won't matter much on a modern computer, but it adds up!

teekert · on Dec 4, 2020

Considering the amount of times I have tried to turn on Dutch spell checking in LibreOffice and given up, it's still a major feat ;)

vmception · on Dec 4, 2020

I just want to take a moment and point out that Apple’s built in spell checker does not offer suggestions on the simplest errors

I frequently copy and paste a word into Google just to get the correction auto suggested and auto searched for, whereas Apple operating systems will know it is spelled wrong as an unrecognized word but offer no suggestions

benwen · on Dec 4, 2020

Borland Turbo Lightning, TSR (terminate and stay resident) spell checker for the lowly 8088-based IBM PC.

https://winworldpc.com/product/turbo-lightning/1x

aaron695 · on Dec 4, 2020

How much better are they today scored as a number? Like Elo.

I wish we had spell checking (with corrections) competitions.

Computer based chess is very cool.

But a little time spent on a computer based spellchecking competition would help a lot of native and non native speakers every day.

And I'm just talking English, to start.

porpoise · on Dec 4, 2020

I wish IMEs were a thing for English input on desktops - the keyboard app you use on your smartphone is effectively an IME for all practical purposes.

I rarely need any spellchecking on smartphone because most spelling errors have been eliminated at the point of word entry.

macjohnmcc · on Dec 4, 2020

I wrote one for Windows back in 1990 and it was a lot of work supporting so many word processor formats. The spell checking itself wasn't too difficult.

transfire · on Dec 4, 2020

Try writing one that operates in real time. It's still a challenge to this day -- and a good example of how slow computers are despite all those OPS.

dzhiurgis · on Dec 4, 2020

What amazes me is how relatively unknown company from Russia (Jetbrains) are able to make a better spell checker than SV mammoth Apple...

nytgop77 · on Dec 4, 2020

it's chezh company.

kartoolOz · on Dec 4, 2020

Author of the symspell port for python using c++/pybind11 https://github.com/viig99/SymSpellCppPy

Modern spell checkers are fairly interesting beasts. Do check out, use and let me know if i can help with any issues / features.

Havoc · on Dec 4, 2020

It's amazing how fast expectations have changed.

Now that I think about it spellcheckers were pretty sh!te not that long ago...

garfieldnate · on Dec 7, 2020

This dictionary listing thing might work okay for a language with little morphology like English, but surely something smarter is required for Finnish, Turkish, etc.; Sometimes you can have a million word forms for one base, so you can't really just put them all into a file.

knolax · on Dec 4, 2020

Maybe it should've been a feat left unaccomplished. Far too often do we have software trying to carry out the impossible task of guessing what the user wants without asking. I don't want the computer to tell me what I want, I just want it to do what I tell it to do.

nsxwolf · on Dec 4, 2020

And still Apple’s spell checkers can never help me spell “bureau” (just had to Google for it again)

dukeofdoom · on Dec 4, 2020

Good Article on String similarity.

https://itnext.io/string-similarity-the-basic-know-your-algo...

chadcmulligan · on Dec 4, 2020

Here's a nice Delphi drop in spell checker for your next windows Application.

http://www.addictivesoftware.com/addict.htm

donut · on Dec 4, 2020

Here is a spell checker from Peter Norvig:

https://norvig.com/spell-correct.html

ncmncm · on Dec 4, 2020

I am astonished that no one mentions Bloom filtering in this context.

Unix has had spell checking since PDP-11 days, in 64kB data. The secret was a Bloom filter.

mattmanser · on Dec 4, 2020

For context, one of my colleagues threw together a spellcheck in his spare time in 2006-ish, maybe 2007.

By then it was fairly trivial.

JoeAltmaier · on Dec 4, 2020

So did an anagram generator! And a word-match algorithm. They lie at the intersection of digital and analog worlds.

vbezhenar · on Dec 4, 2020

It's 2020 and I still don't have proper spell checking in my computer. What a shame.

mark-r · on Dec 4, 2020

The Bloom filter was invented in 1970. I wonder how well known it was in the 1980s?

svnpenn · on Dec 4, 2020

Here's an implementation in PHP

    $a = file('words');
    $m = array_flip($a);

abalone · on Dec 4, 2020

> A program to load /usr/share/dict/words into a hash table is 3-5 lines

Nooo.. For gods sake at least use a trie. This kind of thing makes me want to like “annoying” algorithm interview questions.