I'm particularly interested in this one, and I'm curious about how useful something like this would be to use.
> The second does not use a dictionary at all (Morris & Cherry 1975). Like the previous method, it divides the text into trigrams, but it creates a table of these, noting how often each one occurs in this particular piece of text. It then goes through the text again calculating an index of peculiarity for each word on the basis of the trigrams it contains. Given pkxie, for instance, it would probably find that this was the only word in the text containing pkx and kxi (and possibly xie too), so it would rate it highly peculiar. The word fairy, by contrast, would get a low rating since fai, air and iry probably all occur elsewhere, perhaps quite often, in the passage being analysed. Having completed its analysis, it draws the user's attention to any words with a high peculiarity index. Like the previous method, it would fail to spot a high proportion of ordinary spelling errors, but it is quite good at spotting typing errors, which it was designed for. An advantage that it has over all dictionary-based methods is that it is not tied to English; it will work on passages of, say, French, German or Greek.
Great reference. This highlights an interesting property of 'reasonably-solved' problems like spell checking and why they might be so much more straightforward now.
It's often not that we've developed groundbreaking algorithms that make solving the underlying problems intrinsically easier - the techniques we're using (like ngram modeling, in this case) may be the result of research work decades ago.
Instead the difference is that the fruit of that prior work has been implemented and made available in more accessible forms (libraries and source code) - and becomes easier to re-use, reason about, and modify thanks to abstraction and languages that have evolved to handle similar problems in a more expressible manner.
(upgrades in hardware and resources certainly help advancement too, but spell checking's probably a good example of a situation where an efficiently-designed implementation's likely to be noticeably more responsive, whether it's 1980 or 2020)
From what they said above, it sounds like that's exactly the kind of thing it would catch unless you consistently wrote "ture" and "flase" many times in the same document
> From what they said above, it sounds like that's exactly the kind of thing it would catch unless you consistently wrote "ture" and "flase" many times in the same document
The trigrams tur and ure are not that uncommon, nor fla or ase (though maybe they are less common in a programming context, by but even then it might depend on domain).
I have abbrev mode i Emacs for that. No more "teh" instead of the. I found most of them went away when I switched to Dvorak, since most of my errors like that are when the letters are in one hand. "teh" however became the all time high. 4000 corrections this year according to abbrev mode. And English isn't even the language I write the most.
If you just want to fix typos then it seems to me that Damerau-Levenshtein distance would be the best approach - a DL distance of 1 from a dictionary word is a high likelihood of a transposition or a missing or extra character.
I've thought of a variation on that, if you're spell checking a large text file. Create an index of all the unique words in the file, along with how many each appears. Any unique words with a count of 1 are likely to be misspelled.
It'd probably be better to look for words with small Levenshtein distance that look like a case of typo, because people tend to miss-spell some words repeatedly, especially in larger texts.
A while ago, I was considering building a text editor that ran SQL for a very specific purpose at work. I mused about adding a syntax checker/suggested. I wonder if the above approach might be a better approach to a look up table.
In my first job I worked for Tasman in Leeds and produced a Word Processor for IBM PC compatibles in 8086 assembler with some help, and then a spelling checker.
For the spelling checker I did a whole load of analysis on a 70,000 word list from Collins and produced a list of tokens to represent common strings of letters. However, in the end I really had to cut the original word list down to get the whole thing onto a single 360K floppy.
After I left Tasman, I was lying in bed one night still thinking about it and realised where I had gone wrong. The tokenising thing, which someone else had put me onto had blinded me. I had stared at word lists for months and hadn't pinned down the obvious pattern. All but 26 words in the 70,000 word list shares the bulk of their characters with the word before it.
So the solution is use 5 bits of the first byte as a count of chars from the word before, 3 bits indicate commons enddings (ship, s, ing) or that the following bytes are tokens for the rest of the word. With this I got the word list compressed to less than 2 bytes per word.
I took this back to Tasman. They put all 70,000 words and the spelling checker onto a 175K floppy for the ZX Spectrum +3.
In comparison, here a quote from the OP’s blog entry:
“Fast forward to today. A program to load /usr/share/dict/words into a hash table is 3-5 lines of Perl or Python, depending on how terse you mind being. Looking up a word in this hash table dictionary is a trivial expression, one built into the language. And that's it. Sure, you could come up with some ways to decrease the load time or reduce the memory footprint, but that's icing and likely won't be needed. The basic implementation is so mindlessly trivial that it could be an exercise for the reader in an early chapter of any Python tutorial.
That's progress.”
But is a simpler, less efficient method progress? Sure it allows more words to be added/removed with ease, and I don’t want to advocate over-optimization, but the solution you made for the Spectrum seems better because words don’t change much. Why don’t we use a similar specialized hash and compressed dictionary format to increase spellchecking speed and allow more words in less space? We could still produce that format using /usr/share/dict/words and similar.
Problems tend to have more than one solution. GP's solution should be documented, yes, but the alternate solution that won out was computers being capable of storing a million words or so in plaintext very easily, and doing the same using their compression scheme just isn't really worth the space saved nowadays.
Also, compressing could actually be slower for modern computers. Remember when compressing your hard disk made your PC faster, up until disks became faster, then it actually made it slower?
Today's CPUs are very fast, so the trend could have flipped again, that would be an interesting benchmark.
How many operations and objects? The method he’s talking about would seem more efficient for the purpose vs all of the strings still being created even if never used in the plain hash version.
They could have been non-copyrighted at the time. Database rights didn't apply in the UK until 1998, so I think it would be fine to have an intern type in just the words into the computer and not infringe on anything.
I don't know/remember the terms under which this happened. I do remember that when I was trimming it down I found the list contained the trademarks of a competitor. These were removed.
In the case of the ZX Spectrum +3 there wasn't enough disk or memory to hold the uncompressed word list, nevermind some form of tree structure.
The search method involved uncompressing each word (one at a time) until either the word was found or one that should follow it.
However, I did intend to use a binary search of the word list on the floppy. I arranged it so any zeroes in the list indicated the start of a word from which decompression could start. Under maximum compression there would be only 26 zero bytes in the list, but by selecting short words at regular intervals I could spinkle zeros throughout the list (approx 1 per block). A binary search could scan for the zeros, decompress the associated word and find the section that should contain the target word.
Tasman didn't go for this. They sorted all the words to be checked in memory, then opened the file and did a complete scan.
A good spell checker is still a hard engineering problem, despite the hardware progress.
Just a hash map ain't gonna work.
The only reason spell checking is perceived as a solved problem is availability of libraries.
Here's an open source example https://github.com/hunspell/hunspell way above 10k lines of code.
I speak 4 languages, and in my experience what's in Microsoft Office is the best one I used so far.
I suppose, it is worth giving a shout-out to a recent Hunspell port to Python by Zverok: https://github.com/zverok/spylls, this description from Github sums it up nicely:
> Hunspell is a long-living, complicated, almost undocumented piece of software, and it was our feeling that the significant part of human knowledge is somehow "locked" in a form of a large C++ project. That's how Spylls was born: as an attempt to "unlock" it, via well-structured and well-documented implementation in a high-level language.
It's incredible how much work has been done (along with documenting algorithms!) in this one-man project.
Exactly. It's a very important argument against calling it a solved problem. I believe the problem with hunspell is that it's very hard to get into fixing it (it's good, but it's not perfect) and spylls makes it feasible again
To be fair, Hunspell was written specifically for Hungarian because it's such a difficult language to handle efficiently in a spell checker. It just happens to also work for other languages.
If you only need English, the complexity of Hunspell is not required.
I still haven't found a decent Hungarian spell checker, they get confused by rare words that have the same letters as a very common word, but different accents.
E.g.: The word "és" means "and", which makes it much more common than "es", which is the word root for "fall" and is rarely used without a suffix like "esett" or "estek".
I'm yet to see a spellchecker that's smart enough to fix "something es something" but not "le es".
First of all it's a verb with -ik ending so you can't write "es", the root lexical word is "esik".
Second you never use anything like "le es" because generaly you always have to
use the verb and prefix together ("leesett", "leesik" etc.) unless you use a commanding form ("ess le") or a modal verb ("le akart esni, "le fog esni" etc.)
"le es" is strictly wrong because it's the wrong verb (should be "esik") and you have to write together ("leesik") so obviously a spellchecker picks that up.
If french is one of your language, state of the art has been Druide's Antidote for more than 20 years. And it has had a Linux version available for almost 15 years, too.
I would argue that it's not a hard engineering problem. It's just a problem that should be solved. Linguists made a lot of formalizations about language. There are rules, there are known lists of exceptions, there are dictionaries. You just have to implement those things carefully with lots of edge-cases, implement it for every language, but it's just mundane engineering work, nothing that requires breakthrough.
Currently spell checking just checks words from a dictionary. Its 1% of work. Or even less.
Microsoft Word used to have awesome spell checking, grammar checking, etc back in 2000 for Russian language. It seems to be degraded since then. But the fact is, this problem was solved. It checked for spelling, for grammar, for punctuation. I would expect that kind of functionality working in every OS textbox by now.
>You just have to implement those things carefully with lots of edge-cases,
Say hello to homophones, ambiguous phrases and double meaning.
In Spanish a lot of verbs share the singular first and third persons. "Estaba" can mean "I was" or "He/she was.". Or "fuimos", (we were); but also "we went".
Google's spell checker still is a major feat of software engineering ;-) It must handle every language, handle new words constantly being coined, have low enough latency and high enough throughput to run it on every web search. (Well, caching probably solves 2/3 of traffic.)
While it's not bad, it's not that great a universal tool either. My native language makes use of a lot of compound words and Google's spell checker often gets confused when I combine words according to the standard grammar.
I can see and understand the technical limitations, but tools like Microsoft Word seem to do a much better job than Google's spell checker, even in things like Google Docs.
Google search will often suggest splitting up words and sometimes even does it transparently, which can give entirely wrong results because suddenly Google matches words across a sentence instead of specific compound words. It's kind of frustrating to have to resort to quotation marks for some single-word search terms.
I get the feeling Google's spell checker doesn't check spelling, it just tweaks the input until it manages to find more results. Not quite the same, because a lot of "fixes" often have entirely different meanings in my experience.
Yeah, exactly! - this is why building the perfect spell checker is still a challenging and interesting engineering problem, in 2020. The nature of misspelling varies from language to language, especially with non-letter-based languages like CJK, or with input methods that lead to different sorts of typos than a regular keyboard.
Out of curiosity, what is your native language, Dutch? Can you give an example of something that Google’s spell checker screws up?
Not all spellcheckers suggest it wrong, but almost all that rely on statistics, will get it wrong: hoofdaannemer (or hoofdanything), will occur far less than hoofd or aannemer.
This, I fear, is forming our language use. As people rely on tech, they'll follow the suggestions and wrong use of space will turn an actual problem.
This exactly. I've often wondered if the default spell checkers that come with phones nowadays are slowly changing the language itself. If Google doesn't fix these issues in the nearish future, an entire generation will grow up with a device that tells them the spelling they learnt in school is wrong.
And with so many people just accepting Google's anti-compound spelling as correct, many people will be mostly exposed to a version of their language with more spaces, and it'll start to "feel" right to them.
Well, yes, it has to work that way to handle searches for new companies that chose their name based on some non-word with an available domain. If someone types "nvidea" they probably mean "nvidia", and Google doesn't handle that with a dictionary. Same for the handles of social media celebrities.
I dunno, really. It is much better than Word on English, but it is laughable when I jumble several languages on a single document, and doesn't handle some laguages that relies on code-switching. At least Word knows its boundaries.
Their google search spellchecker is good, but it's ironic that their android / keyboard spell checker is quite literally the worst I have ever used. Years in and it's maybe 5% as accurate as aspell.
I feel it's more statistics based than true spellchecking? At least I often feel that it wants to rewrite my sentences to something completely different because what I'm searching/writing is uncommon but similar to a more common word.
It's a nice demo and a good tutorial if you are interested in spell checkers. But imo it did more bad than good. A lot of libraries started to implement it. But it's quite horrible in performance.
...and yet the first thing I switch off in every tool is the spellchecker, because they get utterly confused with the mix of English, German and my local dialect that I'm using when communicating with different people. I'd say doing a spellchecker "right" is apparently still a major feat of software engineering ;)
I wish it just struggled with the mix of multiple languages, it can't even handle my native one. A plural compound word in translative case with a clitic? I can only dream about it, it can't even do all the cases reliably.
I guess what I'm wondering is how obscure it has to get (or perhaps how low the GDP of the people speaking a language has to be) before custom spell checking rules aren't considered worth it to bigcorps anymore. Though I'm also kinda interested in what this weird language thing is you're talking about.
Microsoft has an okay Finnish spellchecker. It's really a shame no Finnish university has developed an acceptable open source Finnish spellchecker though AFAIK.
Word-list based checkers actually work for English, which gives an impression that they would work for all types of languages. For agglutinative languages (which I think is about 50% of world languages) not so much, since the list of valid words is basically infinite.
But combined with a high-quality root word list it's possible to write a computer program that returns if a given input word is a valid form of some known root word, i.e. a spellchecker.
In English, you can think the problem same as writing a spellchecker that can tell if an input word is a valid chemical name for a molecule, like 1,3,7-trimethylxanthine (caffeine) or adenosine-5'-triphosphate (ATP). Clearly, we can not write an exhaustive list of valid words here. But it's pretty easy to write a computer program that can tell if the input word at least seems like a valid chemical name devoid of typos.
I often write in a mixture of Scots and English, and spellcheckers are nae guid for that use case. The problem is solved, however, on iOS and Android thanks to SwiftKey[1].
Yeah, to add to it a lot of Germans will use American English but some British English and my spellchecker can't handle the mixture. Off topic but despide how difficult German has been to learn I've found spelling in German to be far more predictable and straightforward than my native language English.
Right, I vaguely remember hearing about this. Another benefit is you usually have a fairly good idea how a word is pronounced, which is not the case for English or French in my exprience. In some ways I guess American English attempts to standardise and fix some rough edges of English.
American English changes a few cases, but there are often exceptions (color, favor, but four, glamour; -ize, but still advertise, compromise etc).
There are too many differences for me to summarize (American and Oxford English spelling there), but I think the problems with English spelling are much deeper than American English can fix.
This is a rather bad article because it completely misses the real complexity of a spell checker.
A spell checker is not simply a list of words, it's a way to check mistakes according to a standard and to point towards ways to fix these mistakes. This not reducible to a look-up in a hashtable. It requires taking into account some complicated things about the definition of a word and the context in which it is written.
You might think that's grammar checking but the boundary is not clear and in any case, any language processing application starts with tokenizing and deciding what counts as words and on what basis.
What is a word even ? Is "CIA" a word ? What about "C.I.A." ? What about C (as in the language) ? What about c (as in the speed of light) ? 2,4-Dinitrophenylhydrazine ?
How does the spellchecker handle dashes and apostrophes ? What about proper nouns ?
The feature creep is irrelevant to the article. In 1984, a spell checker could be just a list of words and would still be a major engineering challenge.
Arguably detecting typographical (or transcription) errors is still non-trivial today since a) edit distance is NP complete and b) selecting the correct spelling often depends on grammar as well as semantic context.
For example, consider the erroneous phrase "he was put through the ringer." Although "ringer" matches a spelling in the dictionary, it doesn't make sense semantically (a "ringer" being a device that rings bells, a near-duplicate of something else, etc.) and the proper idiom is "put through the wringer" (since a wringer is/was a device to squeeze water out of a wet mop or wet laundry. Squeezing someone through a pair of rollers is particularly evocative.)
Although you do see "nerve-wracking" or "wracking" (i.e. wrecking) one's brain, the more traditional "racking" (literally to torture by stretching on a medieval rack) seems more appropriate (although the term "nervous wreck" is common.) Shakespeare may have exploited the pun of "wrack" vs. "rack," so perhaps we can also.
"Security breaches" and "security breeches" sound alike but have somewhat dissimilar meanings. Network and system administrators might consider donning the latter in preparation for the former.
Yes. Up until last two sections is prompt, then I let GPT-3 fill in the responses.
Non-standard English: Please provide me with a short brief of the design youre looking four and some examples or previous projects youve done would be helpful.
Standard American English: Please provide me with a short brief of the design you’re looking for and some examples or previous projects you’ve done would be helpful.
Non-standard English: If Im stressed out about something, I tent to have a problem falling asleep.
Standard American English: If I’m stressed out about something, I tend to have a problem falling asleep.
Non-standard English: There are plenty off fun things too do in the summer when you are able two go outside.
Standard American English: There are plenty of fun things to do in the summer when you are able to go outside.
Non-standard English: She didnt go to hte market.
Standard American English: She didn't go to the market.
Non-standard English: The database sever was updated.
Standard American English: The database server was updated.
Non-standard English: The next procedure is to sever the artery.
Standard American English: The next procedure is to sever the artery.
(In case you're wondering, if I provide 'server' as the input for the second case, it replaces it with 'serve'. Which is reasonable. I tried changing the wording to coax it into placing sever but didn't have much luck)
Indeed, I think most linguistic tasks could benefit from the neural (transformer) approach, even a tiny spellchecker. The combination of semi-regular rules, semantics and logic makes a highly effective "manual" solution too expensive to obtain (in terms of research and programming time).
I don't think I ever made that specific mistake, but (as a non native speaker) I spent a couple minutes staring at your example and was about to write a comment asking what mistake you were talking about before I spotted the typo.
Yes, at least for Levenshtein-style edits (substitution, insertion, deletion).
Although, I suppose a spell checker algorithm would be O(mnd) where d is the size of the dictionary, because it needs to compute O(mn) edit distance for each of the dictionary words.
This isn't in any way a criticism, but I found it really amusing that your post contained almost exactly the error you were describing ("hat") before correction.
Our capability of doing simple spellcheckers has maybe improved, but our standards have risen as well. Back then it might have been amazing, nowadays it's boring and every program has it, even my Firefox browser does.
A spellchecker is only a limited solution to the task of proofreading. I've never used the grammarly product, but at least from the ads it seems it can catch many things that human proofreaders would catch. There might still be domain-specific things that requires humans.
But the grammarly product has way more logic behind it than just using a simple table of words. No idea if it's parametric or uses an ML language model, or maybe a combination, but the feature set wouldn't be possible with just a table.
So instead of a lone employee being tasked to write a spellcheker, you suddenly have an entire industry focused on NLP. That's how the growth of the computing industry looks like :).
The article on another thread [1] wrought up "Andy and Bils's Law":
"For every cycle a hardware engineer saves, a software engineer will add two instructions."
I thought of that reading last paragraph:
"Fast forward to today. A program to load /usr/share/dict/words into a hash table is 3-5 lines of Perl or Python, depending on how terse you mind being."
Spell chequer by Martha Snow
Eye halve a spelling chequer
It came with my pea sea
It plainly marques four my revue
Miss steaks eye kin knot sea.
Eye strike a quay and type a word
And weight four it two say
Weather eye am wrong oar write
It shows me strait a weigh.
As soon as a mist ache is maid
It nose bee fore two long
And eye can put the error rite
It's rare lea ever wrong.
Eye have run this poem threw it
I am shore your pleased two no
It's letter perfect awl the weigh
My chequer tolled me sew.
I especially like the spellchecker added to the D compiler. The neato feature is the "dictionary" is the part of the symbol table that is in scope. In my usage it guesses right about 50-75% of the time.
I've been considering adding one to my text editor. I found out I'm not as good a speller as I thought I was before spellcheckers :-/
> I've been considering adding one to my text editor. I found out I'm not as good a speller as I thought I was before spellcheckers :-/
We all make mistakes and it’s okay, there is no shame in it! Let the computer fill in the gaps for you so you can use your brain for other things :) I shamelessly use a spell checker in my IDE and it has been a very positive experience, highly recommend!
My spelling has always been bad. I'm not dyslexic, but I did get extra help as a kid. In theory spellcheckers should be great for me. But in reality I seem to make spelling mistakes that leave some checkers unable to make useful suggestions. I can be just one or two vowel substitutions away from the correct spelling, but they will fail to suggest the right answer.
I should start keeping notes when that happens as I'd really like to understand what I'm doing that causes them issues, it might help me when I'm struggling to spell a word and get stuck.
Really, there are two components to a modern spell checker. First, identify the words that are misspelled. Second, offering (good) corrections. This article is really talking about how hard even the first task was when memory was scarce. Offering good suggestions is still non-trivial.
Third: identifying words that appear in the dictionary, but are not used properly. And that's not looking at compounds and agglutinative languages. But there's no need to look that far: capitalization isn't even 100% solved. The article does software development and linguistics a bad service by suggesting spell checking is just loading a dictionary into memory.
Thank you. This is the perfect way to word this. My blood was kinda boiling reading the article because I've implemented a spell checker and it's sure as hell not trivial.
However, my experience is exactly what you described. I got it to identify / highlight misspelled words in an afternoon, but then I came to suggesting the correction and things got hairy quick.
Also, my spell checker wasn't even "smart" (aware of semantics, context, homophones, etc).
I made toy spell checker. My way of making suggestions was to take a misspelt word (ie not in the dictionary) and permute some letters - look up the new words - if it was in the dictionary - offer those as suggestions.
Interestingly enough, this video from the 1980s shows Brian Kernighan writing a one line spellchecker program in UNIX shell. Obviously, the computer that it’s running on is more powerful than a 256K PC. The point stands: some people are simply living in the future.
One of the early implementations of Unix spell used a Bloom filter to compress the word list:
“Two different hashing methods have been implemented. The first, based on a simple superimposed code scheme first proposed by Bloom,9, 10 was supplied by D. M. Ritchie and succeeded in encoding a 25,000-word list into 50,000 bytes. A more elaborate method, in which values of a conventional hash func- tion are represented in a differential Huffman code, squeezed 30,000 words into 52,000 bytes. The stop list is handled by the same method in a different process.”
I was recently tasked with finding alternative to our age old spellchecker. Main criteria was it should be a separate api in cloud(we are not completely in cloud yet). Also I couldn’t use any commercial cloud spell checkers because of data privacy and security. After putting up a demo with an open source spellchecker we ultimately decided to abandon the effort. We found out that most of our competitors were recommending their users to use browser provided spellchecker like the one provided by chrome. We followed the same suite because we felt that being a relatively small development team our time was best spent writing business logic than to to create and maintain a spellcheck service.
Another great example of this is Key Word In Context [1], which is described in this paper [2] as being possible to implement in a week or two.
My professor (Josh Bloch, in his API design class at CMU) showed us this paper, then showed us his quick implementation using standard Java APIs in less time than the rest of the lecture.
The venerable Ispell program dates back to 1971 with the most recent update in 2015 [1]. Its Contributors files traces earlier spell checkers back to 1959 [2].
I don't want to brag, but it wasn't that hard to write a spellchecker in 256KB in 1984.
I was in primary school (age 11) and we had a Microbee at home with 64KB of RAM and a copy of Turbo Pascal. My sister is/was dyslexic so with a bit of help from my older brother (who would have been in first year of computer science at the time) I wrote a spellchecker for her. It think it might have been a Christmas present, so I had probably been programming for less than a year at that stage, and only in the afternoons after I got back from school.
It read the whole document into memory (not the English dictionary, which was too big) in a tree structure. Pointing out that I could use a binary tree and teaching me how to use pointers was my brother's contribution. Then it read through an on-disk copy of the English dictionary. It wasn't a complete dictionary, so I didn't do any compression on the storage, and then asked the user (my sister) to review each word that wasn't in the dictionary, sending any issues out to the printer with a line number and the word in bold.
So if a kid with very little experience could do it, professionals would have had little trouble back then.
Writing a spell-checker on the computer we had before that though (the ZX-spectrum with 16KB of RAM and the only bulk storage being a cassette tape)... that would have been hard. We've definitely progressed since then.
A spell checker isn't something in a binary state of working or not working. Its something that has the almost impossible task of working out what the user wanted and not what they asked for.
Maybe most of your problems are a single letter mistake or a keyboard slip up but where good spell checkers shine is they know what you want even when you are miles off. I find googles spell checking to be exceptional at understanding the mapping between how a word sounds like and what it actually is even when they share very few letters in common.
An easy example of what I mean is if the input is "shivon" and the spell checker is able to correct this to "Siobhán" because it knows this is how users try to spell it when they have no idea. A simple algorithm isn't able to do this because there are no logical rules of english to follow here, you would likely need a massive amount of user data to train on to solve this test case.
Of course, I wouldn't dream of saying that spelchek.pas (as I think I called it because 11-year-old me thought that was hilarious) was state of the art at the time, nor would it be sufficient for any purpose today. But it solved the core of the problem: identifying words with a likely mis-spelling, which is all the original article was talking about.
I get the point of the article that today's improved hardware allows things that weren't possible before.
But neither spell checking nor its more advanced siblings spelling correction or automatic spelling correction is solved.
Here is an example:
"I want to by Apple"
"I want to be Apple"
"I want to buy Apple"
Because "by" is a genuine English word, a spellchecker based on a simple hashtable will not detect the error, which obviously exists.
We need to look at the context of the whole sentence or even document. Today we can use deep learning and word embedding to solve/improve that problem.
A simple spelling corrector that is based on edit distance would always prefer "be" over "buy" because "be" is the more frequent word of both.
So the risk of all the easy accessible libraries for a certain task is, that you think the problem is solved, but you overlook many problems because the topic is much more diverse and deep than you would think at the first glance
Another point is that while hardware improved, also the expectations have risen. We expect milliseconds latency instead of seconds, we often require the software to work in a multi-tenant architecture with thousands or millions of concurrent users.
So even for seemingly simple and straightforward tasks, and despite all the exciting advances in hardware, always new challenges present themselves.
That would be a grammar checker though. And there are many solutions available today to do that. Even open source solutions. Grammarly would be one example, language tool another.
The ones I looked at work by encoding common spelling mistakes into a grammar (as in Chomsky grammar) and then running that over the text.
No, that's not grammar checking, but spelling mistakes that turn out to be real words!
If you misspelled "college" as "collage", or you misspelled "three" as "tree" the word you typed incorrectly happens to be an actual word itself!
Correcting these types of errors is called "real word spelling correction".
Real-word spelling errors are words in a text that, although correctly spelled words in
the dictionary, are not the words that the writer intended
It's so weird that spellchecking isn't context sensitive: both locality and use.
Locality: The pair "buy Apple" must be vastly preferred, like millions of times more.
Use: I'm writing this and my swipe keyboard offers "spellcasting" when I want "spellchecking". The page I'm on is about spellchecking and the other word isn't one I've ever used until now. You can split these down in to prior-use and use-context, I guess: the former is most annoying, always having to correct in the same manner.
Fast forward to today. A program to load /usr/share/dict/words into a hash table is 3-5 lines of Perl or Python, depending on how terse you mind being. [...] That's progress.
I was curious -- This Python script consumes 9.5 MB of RAM on my Mac, which is a whole lot of 1980s-era PCs. Sure, in most cases this one decision to use a terse but unoptimized data structure won't matter much on a modern computer, but it adds up!
I just want to take a moment and point out that Apple’s built in spell checker does not offer suggestions on the simplest errors
I frequently copy and paste a word into Google just to get the correction auto suggested and auto searched for, whereas Apple operating systems will know it is spelled wrong as an unrecognized word but offer no suggestions
I wrote one for Windows back in 1990 and it was a lot of work supporting so many word processor formats. The spell checking itself wasn't too difficult.
This dictionary listing thing might work okay for a language with little morphology like English, but surely something smarter is required for Finnish, Turkish, etc.; Sometimes you can have a million word forms for one base, so you can't really just put them all into a file.
Maybe it should've been a feat left unaccomplished. Far too often do we have software trying to carry out the impossible task of guessing what the user wants without asking. I don't want the computer to tell me what I want, I just want it to do what I tell it to do.
I'm particularly interested in this one, and I'm curious about how useful something like this would be to use.
> The second does not use a dictionary at all (Morris & Cherry 1975). Like the previous method, it divides the text into trigrams, but it creates a table of these, noting how often each one occurs in this particular piece of text. It then goes through the text again calculating an index of peculiarity for each word on the basis of the trigrams it contains. Given pkxie, for instance, it would probably find that this was the only word in the text containing pkx and kxi (and possibly xie too), so it would rate it highly peculiar. The word fairy, by contrast, would get a low rating since fai, air and iry probably all occur elsewhere, perhaps quite often, in the passage being analysed. Having completed its analysis, it draws the user's attention to any words with a high peculiarity index. Like the previous method, it would fail to spot a high proportion of ordinary spelling errors, but it is quite good at spotting typing errors, which it was designed for. An advantage that it has over all dictionary-based methods is that it is not tied to English; it will work on passages of, say, French, German or Greek.
(The Morris there is Bob Morris).